GPT-4 confidently struggles on radiology exam

The ability of ChatGPT to provide accurate medical information is well documented by now, but if the popular large language model’s most recent radiology exam scores are to be believed, it still cannot hold a candle to human medical professionals.  

A new paper in Academic Radiology details ChatGPT’s struggle to compete with students on the American College of Radiology’s (ACR) Diagnostic Radiology In-Training Examination (DXIT). According to the analysis, the chatbot was not lacking in confidence, though it was in accuracy.  

Researchers from Stony Brook University Hospital and the Northwestern University Feinberg School of Medicine recently put ChatGPT through a series of experiments, including both text- and image-based assessments, across different time points and with additional training to examine how the large language model could adapt to the new information it was exposed to. 

On DXIT, GPT-4 achieved an overall accuracy of 58.5%, which was lower than that of third year post-graduate radiology students but slightly more than second year students.  

Image-based questions were significantly more challenging for GPT-4 to answer, despite the latest version of the large language model now being capable of accepting image prompts. It yielded an accuracy of 45.4% on image-based prompts—significantly lower than the 80% accuracy it achieved on text-based prompts. 

When experts repeated their experiment at different time intervals and fine-tuned their prompts, GPT-4 still struggled to keep up with students' performance. Even after fine-tuning, GPT-4's answers did not improve. In fact, when questions were repeated, it changed its answer more than 25% of the time, with no improvement in accuracy. 

GPT-4 did accurately diagnose numerous critical conditions, but it failed to identify several fatal ones, such as ruptured aortic aneurysm. Despite its fluctuating accuracy, the large language model showed high confidence in over 80% of its answers, regardless of whether they were correct. 

The study’s corresponding author David L. Payne, chief radiology resident at Stony Brook University Hospital in New York, and colleagues advised that their findings suggest similar large language models, though potentially very useful, will “require ongoing monitoring to ensure their reliability in clinical settings.” 

“Clinical implementers of general (and narrow) AI radiology systems should exercise caution given the possibility of spurious yet confident responses as well as a high degree of output variability with identical inputs across time.” 

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She joined Innovate Healthcare in 2021 and has since put her unique expertise to use in her editorial role with Health Imaging.

Trimed Popup
Trimed Popup