AI proves a qualified second reader for screening mammography
Breast screening AI has tied the averaged performance of 552 human readers at detecting or ruling out cancer in a study sample of 120 cases.
Further, when referencing the algorithm’s recall rates to mean human reader performance, the AI notched another tie (90% to 91% sensitivity and 76% to 77% specificity).
The cases contained 161 normal breasts, 70 malignant breasts and nine benign breasts. The readers included 315 board-certified radiologists, 206 U.K. radiographers and 31 breast clinicians.
The study was conducted at the University of Nottingham in the U.K. and published in Radiology Sep. 5 [1].
For the head-to-head matchup, Yan Chen and colleagues used two test sets of 60 challenging cases each. They drew the sets from a performance-assessment scheme called “Performs,” for Personal Performance in Mammographic Screening, which has been used to test mammographers for more than 30 years in the U.K.
The researchers had the human readers take the tests in 2021 and the AI in 2022.
The readers, human and machine alike, scored the images by assigning a suspicion-of-malignancy score to anomalous features in each breast.
Concept proven, clinical to come
Reviewing the results, Chen and colleagues found almost no difference at the breast level between the AUC for AI and the AUC for human readers (0.93% and 0.88%, respectively).
When applying a set threshold for recall scores, they again found no significant difference in sensitivity (84% for the AI, 90% for the humans).
Interestingly, the AI had notably higher specificity than the humans, 89% vs. 76%.
In their discussion, Chen and co-authors acknowledge several limitations in study design. First among these is the small size and atypical composition of mammograms in the Performs datasets.
Given these weaknesses, the cases used were not representative of typical screening populations: They are “enriched” with purposely difficult cases, probably don’t reflect a racially or ethnically diverse set of patients and, for these reasons and more, are questionably generalizable.
Nevertheless, Chen and co-authors maintain, the use of external quality assessment schemes like Performs “may provide a model for regularly assessing the performance of AI in a way similar to the monitoring of human readers. … [F]urther work is needed to ensure this assessment model could work for other AI algorithms, screening populations and readers.”
So much for second reads?
In an accompanying opinion piece, Yale radiologist Liane Philpotts, MD, suggests the success of AI in the Nottingham study means the days of double reading are numbered [2].
While European guidelines recommend double reading of all mammograms, Philpotts remarks, shortages of qualified readers around the world “make double reading an unsustainable burden for which AI is a logical solution.”
More:
“While double reading has generally not been used in the United States, many U.S. radiologists interpreting mammograms are nonspecialized and do not read high volumes of mammograms. Thus, the AI system evaluated by Chen et al. could be used as a supplemental tool to aid the performance of readers in the United States or in other countries where screening programs use a single reading.”