Researchers cite safety concerns after uncovering 'harmful behavior' of fracture-detecting AI model

Researchers are cautioning the overly optimistic AI enthusiasts after a recent algorithmic audit revealed the potentially “harmful behavior” of a validated model intended to detect femoral fractures. 

A study in the Lancet Digital Health reports that a previously validated, high performing AI model committed troublesome errors when confronted with atypical anatomy while seeking out subtle proximal femur fractures. Researchers noted that despite the model’s exceptional performance on external validation, its preclinical performance revealed barriers that would inhibit the algorithm’s ability to be safely deployed in clinical practice. Experts involved in the study acknowledged that this is a common obstacle when transitioning artificial intelligence systems into every day, real world clinical practice. 

“Historically, computer-aided diagnosis systems have often performed unexpectedly poorly in the clinical setting despite promising preclinical evaluations, a concept known as the implementation gap,” corresponding author Lauren Oakden-Rayner, MBBS, with the Australian Institute for Machine Learning in Adelaide, Australia, and co-authors explained. “Few preclinical artificial intelligence research studies have addressed these concerns; for example, external validation—an assessment of the ability of a model to generalize to new environments—has only been done in around a third of studies.” 

Algorithmic auditing has been presented as a possible solution to help overcome the problem of poorly performing computer-aided systems in clinical settings. Auditing can identify and mitigate any issues that would cause the algorithm to maintain diagnostic performance. That’s exactly what researchers did with their preclinical evaluation of a previously validated deep learning model developed to detect proximal femur fractures based off frontal x-ray films in emergency settings. 

The researchers conducted a reader study comparing the model’s performance to that of five radiologists. The dataset contained 200 fracture cases and 200 non-fracture cases. An external validation dataset was also used before conducting an algorithmic audit to detect any unusual model behavior. 

In the reader study, the model’s AUC was .994 compared to .969 for the radiologists. The model also performed well on the external validation dataset, achieving an AUC of .980. However, in preclinical testing, the model encountered issues when presented with cases of abnormal bones, such as those seen in Paget’s disease. This resulted in an increased rate of error and caused the researchers to question the model’s safety in a clinical setting. 

“Given the tendency of artificial intelligence models to behave in unexpected ways (i.e., unlike a human expert would), the inclusion of an algorithmic audit appears to be informative,” the experts said. “Identifying the types of cases an artificial intelligence model fails on might assist in bridging the current gap between apparent high performance in preclinical testing and challenges in the clinical implementation of an artificial intelligence model.” 

The researchers concluded by suggesting that these algorithmic audits are necessary to develop safe clinical testing protocols. 

More on artificial intelligence in medical imaging:  

AI predicts COVID prognosis at near-expert level using CT scoring system

AI software that triages x-rays for pneumothorax receives FDA clearance

AI assists radiologists in detecting fractures, improves workflow

Misuse of public imaging data is producing 'overly optimistic' results in machine learning research

AI tool achieves excellent agreement for knee OA severity classification

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She joined Innovate Healthcare in 2021 and has since put her unique expertise to use in her editorial role with Health Imaging.

Around the web

The nuclear imaging isotope shortage of molybdenum-99 may be over now that the sidelined reactor is restarting. ASNC's president says PET and new SPECT technologies helped cardiac imaging labs better weather the storm.

CMS has more than doubled the CCTA payment rate from $175 to $357.13. The move, expected to have a significant impact on the utilization of cardiac CT, received immediate praise from imaging specialists.

The newly cleared offering, AutoChamber, was designed with opportunistic screening in mind. It can evaluate many different kinds of CT images, including those originally gathered to screen patients for lung cancer. 

Trimed Popup
Trimed Popup