5 findings that could spur imaging AI researchers to ‘avoid hype, diminish waste and protect patients’
As the research mounts on AI matching or besting radiologists in discrete interpretive tasks, so does the excitement in the popular press. Now comes a study showing many of these studies only invite the hype.
Myura Nagendran of Imperial College London and colleagues started their literature review by searching out more than 8,000 documents, mostly study records but also nearly 1,000 trial registrations.
Trial reports met the researchers’ criteria for inclusion if they represented peer-reviewed scientific articles on original research, assessed deep-learning algorithms as applied to a clinical problem in medical imaging and compared an algorithm’s performance against that of human interpreters whose number included at least one expert.
Further, the human group could not have been involved in establishing the ground-truth diagnosis; true target disease status had to have been verified by best clinical practice.
Reporting their findings in The BMJ, Nagendran and team whittle their conclusions to five key observations.
1. The literature review yielded few relevant randomized clinical trials, whether ongoing or completed, of deep learning in medical imaging. Acknowledging that deep learning only went mainstream six or so years ago, the authors note that many more randomized trials are likely to be published over the next decade. Still, they found just one randomized trial registered in the U.S. met their study criteria—despite at least 16 deep learning algorithms for medical imaging approved for marketing by the FDA. “While time is required to move from development to validation to prospective feasibility testing before conducting a trial,” they comment, “this means that claims about performance against clinicians should be tempered accordingly.”
2. Of the non-randomized studies meeting the criteria, only nine were prospective and just six were tested in a real-world clinical environment. Finding that fewer than a quarter of the assessed studies compared time for task completion between humans and algorithms, Nagendran and colleagues underscore that randomized—or at least prospective—trials are best for fairly comparing AI against humans. However, even in a randomized clinical trial setting, “ensuring that functional robustness tests are present is crucial,” the authors write. “For example, does the algorithm produce the correct decision for normal anatomical variants and is the decision independent of the camera or imaging software used?”
3. Limited availability of datasets and code makes it difficult to assess the reproducibility of deep learning research. When included at all, descriptions of the hardware used in the trials tended to be brief, vague or both. The authors suggest these shortcomings could affect implementation and external validation. “Reproducible research has become a pressing issue across many scientific disciplines and efforts to encourage data and code sharing are crucial,” Nagendran and team state. “Even when commercial concerns exist about intellectual property, strong arguments exist for ensuring that algorithms are non-proprietary and available for scrutiny. Commercial companies could collaborate with non-profit third parties for independent prospective validation.”
4. The number of humans in the comparator group was typically small with a median of only four experts. Even among and between expert clinician interpreters, wide variation in diagnosis is not uncommon. For that reason, trial reliability depends mightily on having “an appropriately large human sample” against which to compare any AI system. “Inclusion of non-experts can dilute the average human performance and potentially make the AI algorithm look better than it otherwise might,” the authors write. “If the algorithm is designed specifically to aid performance of more junior clinicians or non-specialists rather than experts, then this should be made clear.”
5. Descriptive phrases that suggested at least comparable (or better) diagnostic performance of an algorithm to a clinician were found in most abstracts, despite studies having overt limitations in design, reporting, transparency and risk of bias. Qualifying statements about the need for further prospective testing were rarely offered in study abstracts—and weren’t mentioned at all in some 23 studies that claimed superior performance to a clinician, the authors report. “Accepting that abstracts are usually word limited, even in the discussion sections of the main text, nearly two thirds of studies failed to make an explicit recommendation for further prospective studies or trials,” the authors write. “Although it is clearly beyond the power of authors to control how the media and public interpret their findings, judicious and responsible use of language in studies and press releases that factor in the strength and quality of the evidence can help.”
Expounding on the latter point in their concluding section, Nagendran et al. reiterate that using “overpromising” language in studies involving AI-human comparisons “might inadvertently mislead the media and the public, and potentially lead to the provision of inappropriate care that does not align with patients’ best interests.”
“The development of a higher quality and more transparently reported evidence base moving forward,” they add, “will help to avoid hype, diminish research waste and protect patients.”
The study is available in full for free.