AI shows bias based on race and sex
In a recent study, an AI model used to read chest X-rays showed a “bias” based on both race and sex in its analysis of scans, resulting in uneven performance and calling into question the use of “foundation models” in a healthcare setting. The findings are published in Radiology: Artificial Intelligence. [1]
Foundation models offer the potential to perform analytical tasks without the need for copious amounts of training data. According to the study authors, this makes these AI applications popular in healthcare, due to the hurdles associated with amassing substantial, high quality data from diverse patient populations.
“There’s been a lot of work developing AI models to help doctors detect disease in medical scans,” lead researcher Ben Glocker, PhD, Imperial College London, said in an interview with RSNA. “However, it can be quite difficult to get enough training data for a specific disease that is representative of all patient groups.”
For their research, Glocker and his colleagues conducted a comparative assessment between a recently introduced chest X-ray foundation model and a reference model they had constructed. The foundation model had been pre-trained using an extensive dataset of over 800,000 chest X-rays sourced from patients in both the United States and India.
In preparation for the study, both of the AIs analyzed 127,118 chest X-ray images, each accompanied by diagnostic labels. The research team then had the models examine 42,884 patient X-ray images, categorized into subgroups based on a patient’s race, in this case Asian, Black and white. The patients had a mean age of 63, with males constituting 55% of the total. The objective of the research was to assess the performance of both models, to see if there were any inconsistent readings based on sex and age.
After reviewing the results, researchers discovered the foundation AI showed the presence of bias related to both race and gender when compared to the reference model. Specifically, there was a notable decline in classification accuracy for the "no finding" label across all subgroups, ranging from 6.8% to 7.8% for female patients. Similarly, the AI's ability to detect pleural effusion, characterized by an accumulation of fluid around the lungs, in Black patients showed a reduction in performance ranging from 10.7% to 11.6%.
As a result, the researchers conclude foundation AI models “may be unsafe for clinical applications.” Glocker said the findings emphasize the need for researchers to have more access to AI models, especially if those applications are being used in a real-world healthcare setting. He believes if these models are to be used in a clinical setting, there needs to be full transparency, which means more research and likely some changes to the technology.
“AI is often seen as a black box, but that’s not entirely true,” Glocker added. “We can open the box and inspect the features. Model inspection is one way of continuously monitoring and flagging issues that need a second look. As we collect the next dataset, we need to, from day one, make sure AI is being used in a way that will benefit everyone.”
Glocker said the study also emphasizes the need for large datasets to be used with AI applications, which goes against the primary selling point that makes foundation models so popular.
“Dataset size alone does not guarantee a better or fairer model,” he said. “We need to be very careful about data collection to ensure diversity and representativeness.”