Researchers cite safety concerns after uncovering 'harmful behavior' of fracture-detecting AI model
Researchers are cautioning the overly optimistic AI enthusiasts after a recent algorithmic audit revealed the potentially “harmful behavior” of a validated model intended to detect femoral fractures.
A study in the Lancet Digital Health reports that a previously validated, high performing AI model committed troublesome errors when confronted with atypical anatomy while seeking out subtle proximal femur fractures. Researchers noted that despite the model’s exceptional performance on external validation, its preclinical performance revealed barriers that would inhibit the algorithm’s ability to be safely deployed in clinical practice. Experts involved in the study acknowledged that this is a common obstacle when transitioning artificial intelligence systems into every day, real world clinical practice.
“Historically, computer-aided diagnosis systems have often performed unexpectedly poorly in the clinical setting despite promising preclinical evaluations, a concept known as the implementation gap,” corresponding author Lauren Oakden-Rayner, MBBS, with the Australian Institute for Machine Learning in Adelaide, Australia, and co-authors explained. “Few preclinical artificial intelligence research studies have addressed these concerns; for example, external validation—an assessment of the ability of a model to generalize to new environments—has only been done in around a third of studies.”
Algorithmic auditing has been presented as a possible solution to help overcome the problem of poorly performing computer-aided systems in clinical settings. Auditing can identify and mitigate any issues that would cause the algorithm to maintain diagnostic performance. That’s exactly what researchers did with their preclinical evaluation of a previously validated deep learning model developed to detect proximal femur fractures based off frontal x-ray films in emergency settings.
The researchers conducted a reader study comparing the model’s performance to that of five radiologists. The dataset contained 200 fracture cases and 200 non-fracture cases. An external validation dataset was also used before conducting an algorithmic audit to detect any unusual model behavior.
In the reader study, the model’s AUC was .994 compared to .969 for the radiologists. The model also performed well on the external validation dataset, achieving an AUC of .980. However, in preclinical testing, the model encountered issues when presented with cases of abnormal bones, such as those seen in Paget’s disease. This resulted in an increased rate of error and caused the researchers to question the model’s safety in a clinical setting.
“Given the tendency of artificial intelligence models to behave in unexpected ways (i.e., unlike a human expert would), the inclusion of an algorithmic audit appears to be informative,” the experts said. “Identifying the types of cases an artificial intelligence model fails on might assist in bridging the current gap between apparent high performance in preclinical testing and challenges in the clinical implementation of an artificial intelligence model.”
The researchers concluded by suggesting that these algorithmic audits are necessary to develop safe clinical testing protocols.
More on artificial intelligence in medical imaging:
AI predicts COVID prognosis at near-expert level using CT scoring system
AI software that triages x-rays for pneumothorax receives FDA clearance
AI assists radiologists in detecting fractures, improves workflow
Misuse of public imaging data is producing 'overly optimistic' results in machine learning research
AI tool achieves excellent agreement for knee OA severity classification