How GPT-4 can improve radiology resident report feedback
Large language models could be leveraged to help provide feedback on radiology residents’ preliminary reports during independent call.
With resources stretched thin at many facilities, this type of feedback can often be limited while residents are on call. The growing capabilities of large language models, like OpenAI’s ChatGPT, could help address the issue by identifying and flagging residents’ missed diagnoses, authors of a new paper in Current Problems in Diagnostic Radiology report.
“Methods for automatically identifying discrepancies between preliminary and final reports have been proposed, primarily using a combination of simple text discrepancy identifiers and standard review macros,” Wasif Bala, MD, with the Department of Radiology and Imaging Sciences at Emory University Hospital, and colleagues note. “Emerging technologies in natural language processing may build on this prior work, enabling new avenues for enhancing feedback.”
To test the feasibility of using LLMs to identify discrepancies between final reports and radiology residents’ preliminary reports, the team input 250 report pairs (residents’ preliminary and attendings’ final reports) into GPT-4 API. The LLM was prompted to identify important findings that were present in the attendings’ final reports but missed in those input by the residents.
GPT-4 identified 24 findings missed by residents, achieving nearly 80% accuracy. Further, the model did so across multiple report formats and variations in phrasing.
Residents who received feedback from the LLM reported mostly positive experiences, rating the model’s notes 3.5 and 3.64 out of 5 for satisfaction and perceived accuracy. Nearly two-thirds preferred integrating LLM feedback with traditional report feedback.
While there are still numerous issues—like the inability to identify semantic differences in reports, rather than just textual variations—that need to be addressed before LLMs can be reliably deployed into resident training, the authors maintain their optimism for the model's future potential.
“The advent of large language models has opened new frontiers for providing effective feedback to medical trainees, enabling the identification of clinically meaningful errors in preliminary radiology reports,” the group writes. “These results highlight both the promising role of LLMs in augmenting traditional feedback mechanisms and the need for further refinement and evaluation of these models prior to widespread adoption.”