How GPT-4 can improve radiology resident report feedback

Hannah Murphy | August 19, 2024 | Health Imaging | Imaging Informatics

Large language models could be leveraged to help provide feedback on radiology residents’ preliminary reports during independent call.

With resources stretched thin at many facilities, this type of feedback can often be limited while residents are on call. The growing capabilities of large language models, like OpenAI’s ChatGPT, could help address the issue by identifying and flagging residents’ missed diagnoses, authors of a new paper in Current Problems in Diagnostic Radiology report.

“Methods for automatically identifying discrepancies between preliminary and final reports have been proposed, primarily using a combination of simple text discrepancy identifiers and standard review macros,” Wasif Bala, MD, with the Department of Radiology and Imaging Sciences at Emory University Hospital, and colleagues note. “Emerging technologies in natural language processing may build on this prior work, enabling new avenues for enhancing feedback.”

To test the feasibility of using LLMs to identify discrepancies between final reports and radiology residents’ preliminary reports, the team input 250 report pairs (residents’ preliminary and attendings’ final reports) into GPT-4 API. The LLM was prompted to identify important findings that were present in the attendings’ final reports but missed in those input by the residents.

GPT-4 identified 24 findings missed by residents, achieving nearly 80% accuracy. Further, the model did so across multiple report formats and variations in phrasing.

Residents who received feedback from the LLM reported mostly positive experiences, rating the model’s notes 3.5 and 3.64 out of 5 for satisfaction and perceived accuracy. Nearly two-thirds preferred integrating LLM feedback with traditional report feedback.

While there are still numerous issues—like the inability to identify semantic differences in reports, rather than just textual variations—that need to be addressed before LLMs can be reliably deployed into resident training, the authors maintain their optimism for the model's future potential.

“The advent of large language models has opened new frontiers for providing effective feedback to medical trainees, enabling the identification of clinically meaningful errors in preliminary radiology reports,” the group writes. “These results highlight both the promising role of LLMs in augmenting traditional feedback mechanisms and the need for further refinement and evaluation of these models prior to widespread adoption.”

Meta's new large language model excels at board-style radiology prompts

FDA adds dozens of AI-enabled radiology applications to list of clearances

AI assistance helps rads shave 1/3 of their reporting times for MS lesions

Hannah Murphy

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.