Even AI struggles to work under stress, study suggests
The ability to adapt to a changing environment is integral in the field of medicine, both for humans and algorithms. The best of both struggle from time to time.
This was the case recently for an award-winning deep learning model touted for its excellent performance in pediatric bone age prediction. When experts subjected the model, which previously won the 2017 RSNA Pediatric Bone Age Machine Learning Challenge, to a “stress test” of sorts, its resultant performance caused researchers to question how it might fare in the real world.
The research was detailed recently in a paper published in Radiology: Artificial Intelligence.
“Despite radiologist-level performance for medical imaging diagnosis, DL models’ robustness to both extreme and clinically encountered image variation has not been thoroughly evaluated,” corresponding author Paul H. Yi, from the Department of Diagnostic Radiology and Nuclear Medicine at the University of Maryland School of Medicine, and co-authors explained. “In clinical practice, image acquisition is variable, and there is no standard of orientation or postprocessing, which could be an overlooked source of error for DL models in radiology.”
Experts initially tested the model on two different datasets: the RSNA validation set, which contains more than 1,400 pediatric hand radiographs, and the Digital Hand Atlas (DHA), which boasts more than 1,200 pediatric hand x-rays. As expected, the model performed well on both, “indicating good model generalization to external data,” the authors wrote.
However, when the test was repeated after altering the images in a way reflective of real-world variations—by being rotated, flipped, inverted, having the marker moved or the contrast, brightness and resolution adjusted—the model’s predictions varied substantially from their baseline, resulting in clinically significant errors for 57% of its interpretations of images from the DHA dataset.
Many of the prediction errors would have resulted in a change in diagnosis and potentially treatment as well. This brings to light the “potential pitfalls in using these models in true clinical practice without physician oversight,” the authors noted, adding that great caution and physician oversight is imperative when deploying similar models into clinical practice.
"This stress testing is crucial for ensuring the clinical readiness of models and for accounting for potential variations in acquisition protocols, vendors, and so on. Rigorous stress testing at several checkpoints in the deployment pipeline can help facilitate improved model development, ultimately leading to the creation of more robust and widely applicable models, thus positively impacting patient care and safety."