GPT-4 can proofread radiology reports for a penny apiece
A popular large language model recently proved itself to be an efficient proofreader of CT reports, effectively revising errors that had previously been missed.
GPT-4 is one of the newer versions of OpenAI’s ChatGPT. Given the large amounts of data it has been trained on, numerous radiology stakeholders have expressed interest in GPT-4's potential for things like generating impressions, data mining, formatting, labeling and error detection. More recently, experts tested the model’s ability to spot and correct errors within radiology reports. They detailed their findings Tuesday in Radiology.
“The increasing demand for imaging has led to a greater workload for radiologists, resulting in burnout and an increase in errors on reports. The consequences of these errors, which can potentially mislead physicians and negatively impact patient care, have prompted efforts to systematically classify and reduce report errors,” Dukyong Yoon, with the Institute for Innovation in Digital Healthcare at Severance Hospital in Seoul, Republic of Korea, and colleagues explained.
The team tasked GPT-4 with identifying errors in over 10,000 radiology reports for head CT scans. After spotting the errors, the LLM was prompted to classify error type (interpretive or factual), explain its reasoning and revise the perceived mistakes.
GPT-4's performance was described as “commendable.” It yielded 84% sensitivity for identifying interpretive errors and 89% for spotting factual mistakes. However, its sensitivity decreased as the number of impressions contained in each report increased, causing the model to struggle to prioritize clinically significant findings.
Compared to the LLM, human readers detected fewer factual errors, and they also took up to seven times longer to review the reports. On average, it took GPT-4 approximately 15.5 seconds to assess reports containing an average of 300 words. The group estimated that it would cost less than 1 cent per report to use the LLM as a proofreader.
“Detecting interpretive errors may be easier for humans. However, factual errors can occur without disrupting the coherence of the surrounding sentences and often result from minor character changes. As a result, even experienced physicians may have a lower detection sensitivity for these factual errors,” the team noted. “In contrast, GPT-4 is highly efficient at detecting factual errors, doing so significantly faster because of its ability to process large data volumes without fatigue or introducing psychologic biases.”
The authors suggested experimenting with different prompt techniques could help address GPT-4's shortcomings when presented with multiple impressions.
Learn more about the model’s performance here.