Fine-tuned LLMs show great promise for spotting radiology report errors
Large language models can be fine-tuned to significantly improve error detection in radiology reports, according to new work published in the journal Radiology.
LLMs are believed to have enormous potential in imaging settings, especially within the realm of radiology reports. Research into the utility of LLMs has detailed their promise in detailing numerous aspects of reporting, including streamlining the process of creating structured templates, rewording impressions to make them more patient-friendly and flagging actionable findings that warrant the attention of referrers.
This latest research highlights the potential for LLMs to correct errors in radiology reports, too. Regardless of what causes a report error—issues with voice recognition software used for dictations, varying perceptual and interpretive processes, cognitive biases, etc.—they can be consequential for patient care. Some errors result in incorrect or delayed diagnoses and can significantly impact disease management.
While there are currently no commercially available LLMs tailored to radiology, several popular models, such as ChatGPT, have proven themselves to be consistently accurate in providing medical information. However, experts involved in this latest study believe that adjustments to data used to train LLMs can improve their utility in medical settings, particularly in radiology.
“Initially, LLMs are trained on large-scale public data to learn general language patterns and knowledge,” study senior author Yifan Peng, PhD, from the Department of Population Health Sciences at Weill Cornell Medicine in New York City, and colleagues explained. “Fine-tuning occurs as the next step, where the model undergoes additional training using smaller, targeted datasets relevant to particular tasks.”
For their study, the team sought to train several LLMs, including Llama-3 (Meta AI), GPT-4 (OpenAI) and BiomedBERT, using a collection of radiology reports. The models were tested on two datasets—one containing 1,656 synthetic reports (828 error-free and 828 with errors), and another with 614 reports (307 error-free real reports from MIMIC-CXR and 307 synthetic reports with errors). The synthetic reports were used to boost the models’ training by fulfilling their need for expansive datasets to learn from.
The LLMs were refined using zero-shot prompting, few-shot prompting or fine-tuning strategies and then tasked with identifying and categorizing errors as either negation, left/right, interval change and transcription-based. The team found that fine-tuning the models significantly improved their error detection and classification performance, with the Llama-3-70B-Instruct model yielding the most accurate results.
During a real-world evaluation, two radiologists reviewed 200 randomly selected reports output by the model. Of those, half were confirmed to contain errors by both rads, while 163 were in agreement with at least one of the human readers.
“The LLM that was fine-tuned on both MIMIC-CXR and synthetic reports demonstrated strong performance in the error detection tasks,” the group noted. “It meets our expectations and highlights the potential for developing lightweight, fine-tuned LLM specifically for medical proofreading applications. Our goal is to develop transparent and understandable models that radiologists can confidently trust and fully embrace.”
Learn more about the team’s findings here.