Shengjia Chen, MS
Icahn school of Medicine at Mount Sinai
Artificial intelligence models are increasingly used in healthcare, yet concerns remain that medical imaging data may encode demographic signals that models can exploit, potentially contributing to biased clinical predictions. In computational pathology, it remains unclear whether histological images contain signals related to patient demographics such as race, age, and sex. We investigated whether deep learning models can infer demographic attributes from hematoxylin and eosin stained whole slide images using a large pathology dataset from the Mount Sinai Health System, comprising 274k patients, 731k pathology cases, and 1.3 million slides across 25 tissue types. Tile embeddings were generated using pretrained pathology foundation models and aggregated using attention-based learning to predict demographic attributes, including self-reported race, age group, and sex. Across organs, models achieved moderate predictive performance, with mean area under the receiver operating characteristic curve around 0.61 and values ranging from near random to 0.85 depending on tissue type. Several organs, including lymph nodes, cervix, breast, uterus, and skin, showed stronger signals. These findings show demographic attributes can be partially inferred from histological images and highlight a pathway for unintended shortcut learning in medical AI systems. Understanding these signals is important for developing fair, interpretable, and reliable AI models for healthcare.
