A majority of AI studies don’t adequately validate methods
Before an AI algorithm can be put into practice it must be adequately validated on external datasets from multiple institutions. But a new study analyzing literature on the topic found most algorithms aren’t properly tested for real-world use.
Authors of the research, published in the Korean Journal of Radiology analyzed 516 published studies and found only six percent (31 studies) externally validated their AI. Of those 31 studies, zero took the necessary steps to determine if their method was indeed ready for clinical use.
“Nearly all of the studies published in the study period that evaluated the performance of AI algorithms for diagnostic analysis of medical images were designed as proof-of-concept technical feasibility studies and did not have the design features that are recommended for robust validation of the real-world clinical performance of AI algorithms,” wrote Seong Ho Park, MD, PhD, the department of radiology and research institute of radiology at the University of Ulsan College of Medicine, Seoul, Korea, and colleagues.
In order for an algorithm to be properly tested for image analysis, a study must consist of three design features: diagnostic cohort design, inclusion of multiple institutions and prospective data collection for external validation, according to the researchers. It’s recommended to use sizable datasets collected from newly recruited patients or institutions that did not provide training data. The data should reflect relevant variations in patient demographics and diseases in the setting in which the AI will be deployed.
Using data from multiple health systems is also central to becoming clinically-ready, Park et al. noted.
The researchers identified studies published in PubMed MEDLINE and Embase between January 2018 and August 17, 2018. They then determined if the study used external or internal validation methods. If external validation was used, they then noted if data was collected with a diagnostic cohort design rather than diagnostic case-control; from multiple institutions; and if it was gathered prospectively.
“Our results reveal that most recently published studies reporting the performance of AI algorithms for diagnostic analysis of medical images did not have design features that are recommended for robust validation of the clinical performance of AI algorithms, confirming the worries that premier journals have recently raised,” the authors wrote.
Park and colleagues noted some studies did not intend to test the real-world readiness of AI algorithms, but were designed rather to determine the technical feasibility of a method. Therefore, those studies merely testing technical ability were not necessarily poorly designed.
In the future, radiologists and researchers should take it upon themselves to distinguish between proof-of-concept studies and those meant to validate the clinical performance of an AI platform, the researchers wrote.