ChatGPT both passes and fails at translating free-text into structured reports
Various versions of OpenAI’s large language model ChatGPT continue to inch their way closer to clinical meaningfulness in radiology, now demonstrating their potential in generating structured reports from free-text dictations.
A new paper in the European Journal of Radiology details the utility of ChatGPT’s latest models—GPT-3.5 and GPT-4—for translating free-text thyroid ultrasound reports into structured reports based on ACR-TIRADS guidelines. Although the results varied between the two models, with each outperforming the other in different aspects of reporting, the authors pose that the clinical potential of similar large language models remains, especially as it pertains to generating structured reports.
“The importance of structured radiology reports has been fully recognized, as they facilitate efficient data extraction and promote collaboration among healthcare professionals,” corresponding author JianQiao Zhou, with the Department of Ultrasound at Ruijin Hospital in Shanghai, China, and co-authors explained. “Despite numerous potential benefits and technical feasibility, the adoption of structured reporting in some countries has remained lukewarm to date. This could be attributed to the presence of many challenges in the process of structuring free-text.”
For the research, the authors compiled 136 free-text thyroid ultrasound reports from 136 patients. In total, 184 nodules were listed in the original reports. The team tasked GPT-3.5 and GPT-4 with generating structured reports from the original versions based on ACR’s TI-RADS guidelines, and had two radiologists review the reports for quality, nodule categorization accuracy and management recommendations.
GPT-3.5 outperformed GPT-4 when creating satisfactory structured reports, generating 202 compared to 69. However, GPT-4 achieved “superior accuracy” and significantly outperformed GPT-3.5 in categorizing thyroid nodules and providing more detailed management recommendations—two integral aspects of TIRADS.
The authors suggested that the differing performances could be owed to the training parameters used to develop each model. GPT-4's parameters are more complex, while GPT-3.5 has fewer specifications, potentially making it more adaptive.
In order to achieve clinical utility, both versions would need to be retrained, but the authors maintain that improving the large language models could pave the way for their use as a supportive tool to radiologists.