Fine-tuned LLMs show great promise for spotting radiology report errors

Large language models can be fine-tuned to significantly improve error detection in radiology reports, according to new work published in the journal Radiology. 

LLMs are believed to have enormous potential in imaging settings, especially within the realm of radiology reports. Research into the utility of LLMs has detailed their promise in detailing numerous aspects of reporting, including streamlining the process of creating structured templates, rewording impressions to make them more patient-friendly and flagging actionable findings that warrant the attention of referrers. 

This latest research highlights the potential for LLMs to correct errors in radiology reports, too. Regardless of what causes a report error—issues with voice recognition software used for dictations, varying perceptual and interpretive processes, cognitive biases, etc.—they can be consequential for patient care. Some errors result in incorrect or delayed diagnoses and can significantly impact disease management. 

While there are currently no commercially available LLMs tailored to radiology, several popular models, such as ChatGPT, have proven themselves to be consistently accurate in providing medical information. However, experts involved in this latest study believe that adjustments to data used to train LLMs can improve their utility in medical settings, particularly in radiology. 

“Initially, LLMs are trained on large-scale public data to learn general language patterns and knowledge,” study senior author Yifan Peng, PhD, from the Department of Population Health Sciences at Weill Cornell Medicine in New York City, and colleagues explained. “Fine-tuning occurs as the next step, where the model undergoes additional training using smaller, targeted datasets relevant to particular tasks.” 

For their study, the team sought to train several LLMs, including Llama-3 (Meta AI), GPT-4 (OpenAI) and BiomedBERT, using a collection of radiology reports. The models were tested on two datasets—one containing 1,656 synthetic reports (828 error-free and 828 with errors), and another with 614 reports (307 error-free real reports from MIMIC-CXR and 307 synthetic reports with errors). The synthetic reports were used to boost the models’ training by fulfilling their need for expansive datasets to learn from. 

The LLMs were refined using zero-shot prompting, few-shot prompting or fine-tuning strategies and then tasked with identifying and categorizing errors as either negation, left/right, interval change and transcription-based. The team found that fine-tuning the models significantly improved their error detection and classification performance, with the Llama-3-70B-Instruct model yielding the most accurate results.  

During a real-world evaluation, two radiologists reviewed 200 randomly selected reports output by the model. Of those, half were confirmed to contain errors by both rads, while 163 were in agreement with at least one of the human readers. 

“The LLM that was fine-tuned on both MIMIC-CXR and synthetic reports demonstrated strong performance in the error detection tasks,” the group noted. “It meets our expectations and highlights the potential for developing lightweight, fine-tuned LLM specifically for medical proofreading applications. Our goal is to develop transparent and understandable models that radiologists can confidently trust and fully embrace.” 

Learn more about the team’s findings here. 

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.

Around the web

RSNA and several other industry societies have shared a new expert consensus document on the significant value of cardiac CT. Echo remains an effective first-line imaging option, the groups wrote, but CT can make a big impact as well. 

"Using AI for tasks like CAC detection can help shift medicine from a reactive approach to the proactive prevention of disease," one researcher said.

Former American Society of Echocardiography president and well-known cardiac ultrasound pioneer Roberto Lang, MD, died at the age of 73. He helped develop 3D echo technology that is now used by care teams on a daily basis.