AI can generate usable radiology reports, but it requires a human touch
New research out of Japan shows that generative pre-trained transformers (GPT) – the underlying AI technology found within ChatGPT – may be able automate the creation of usable radiology reports. However, "hallucinations" mean the technology isn’t reliable without a human companion. The study was published in the Japanese Journal of Radiology. [1]
A group of researchers from Kumamoto University assessed the performance of GPT in generating radiology reports. Utilizing patient-specific information such as age, gender, disease site and imaging findings, radiology reports were created using multiple iterations of the large-language model (LLM) application – GPT-2, GPT-3.5 and GPT-4. The reports generated by GPT were then compared to those created by board-certified radiologists.
The reports were created using data from 28 patients who had previously undergone CT scans and had been diagnosed with diseases characterized by typical imaging findings.
Measuring radiology report accuracy
The study's evaluation focused on top-1 and top-5 differential diagnoses accuracy, and also the mean average precision (MAP). Radiologists achieved the highest score in all three categories.
- Top-1: Radiologists scored a 1.0 for top-1 differential diagnoses accuracy, followed by GPT-4 and GPT 3.5 (0.54 each), with GPT-2 coming in last (0.21).
- Top-5: Radiologists again scored a 1.00, but this time GPT was much closer: GPT-4 scored 0.96, GPT-3.5 scored 0.89, and GPT-2 scored 0.5 for top-5 differential diagnoses accuracy.
- MAPs: Radiologists scored a 0.97 using MAP on differential diagnosis, followed by GPT-4, GPT-3.5, and GPT-2, in that order (0.26, 0.45, and 0.54, respectively).
Measuring radiology report quality
To gauge the quality of the generated reports, two board-certified radiologists assessed various aspects of each report, including grammar and readability, image findings, impression, differential diagnosis and overall quality, all using a 4-point scale.
The results showed that there were no significant differences in qualitative scores related to grammar and readability, image findings and overall quality between radiologists and the GPT-3.5 or GPT-4 generated reports (p > 0.05). However, qualitative scores of the GPT series in impression and differential diagnosis scores were significantly lower than those of radiologists (p < 0.05).
Additional insights and study limitations
The findings of the study provide further insights into the potential and limitations of using GPT and other LLMs for automating reports. Despite the commendable performance of GPT in some aspects, the study underscores the need for clinical experts to be present for any report generation.
“Our findings reveal that, although GPT-3.5 and GPT-4 are commendable in their ability to generate readable and appropriate ‘image findings’ and ‘Top-5 differential diagnoses’ from very limited information, they fall short in the accuracy of impressions and differential diagnoses even for very basic and representative findings of CT,” lead author Takeshi Nakaura, MD, PhD, and colleagues wrote. “Consequently, these results underscore the continued importance of radiologist involvement in the validation and interpretation of radiology reports generated by GPT models.”
Moreover, the study highlights the issue of "hallucinations" in the generated content, where the models may produce findings not directly linked to the input information. This phenomenon is seen as a potential limitation, and the need for strategies to mitigate these hallucinations is emphasized to ensure the accuracy of GPT-generated radiology reports.
“Additionally, the evaluators had differing opinions on whether to consider the output of the GPT as detailed or as a hallucination, which resulted in poor to moderate agreements in qualitative analysis,” the authors wrote. “One evaluator may have appreciated the additional details provided by the GPT, believing that they could potentially enhance the radiology report. On the other hand, other evaluators might have viewed the extra details as hallucinations, as they were not directly related to the input information and could lead to inaccuracies or confusion in the diagnostic process”
It’s important to note these study findings are still preliminary, and more research must be conducted to verify the results. Recently, ChatGPT and LLMs have been subjected to a wide variety of studies measuring their accuracy and reliability, with most findings showing mixed results.