GPT-4 unlocks key insights from free-text radiology reports
Large language models (LLMs) are playing an increasingly significant role in the extraction of valuable information from free-text medical records, offering clinicians a new tool for data-mining clinically relevant insights. New research focused on how well LLMs can extract relevant data related to lung cancer shows they may be pretty good at navigating free-text notes – maybe even better than AI platforms designed specifically for the task.
A new study involved a comparative analysis of ChatGPT, the popular free chatbot, and its paid underlying technology upgrade, GPT-4, to assess the effectiveness of both in extracting and labeling oncologic phenotypes from free-text CT reports related to lung cancer. The results are published in Radiology. [1]
The study began with a comprehensive dataset of 424 free-text CT reports from 424 unique patients. All records in this retrospective study are from patients who underwent lung cancer follow-up CTs between September 2021 and March 2023. After the LLMs completed their tasks of analyzing the CT reports, the results were measured by four participating radiologists, acting in concert as curators of the research.
The findings showed that both ChatGPT and GPT-4 are mostly successful at properly extracting and labeling relevant information related to cancer traits, but the latter is superior to its free-to-consumer counterpart across the board at:
- Lesion parameter extraction: GPT-4 achieved an accuracy rate of 98.6% in extracting lesion parameters, surpassing ChatGPT, which achieved an accuracy rate of 84.0%.
- Metastatic disease identification: GPT-4 demonstrated superior accuracy in identifying metastatic disease, achieving a score of 98.1%, compared to ChatGPT's score of 90.3%.
- Oncologic progression labeling: GPT-4 also excelled in generating correct labels for oncologic progression, achieving an F1 score of 0.96, while ChatGPT scored 0.91.
- Oncologic reasoning: GPT-4 exhibited higher Likert scale scores for factual correctness (4.3 vs. 3.9) and accuracy (4.4 vs. 3.3). Importantly, it had a significantly lower rate of confabulation (1.7% vs. 13.7%) compared to ChatGPT.
The researchers noted that, in this instance, the LLMs actually outperformed Natural Language Processing (NLP), a type of machine-learning used by healthcare organizations to extract information not typically inputted into an EHR.
“The performance of both models in advanced medical reasoning was also evident in the generation of report-level labels for disease progression (F1 score, 0.86–0.96), a classification task that proved difficult for NLP models in a recent study with F1 scores ranging from 0.67 to 0.73,” lead author Matthias Fink, MD, and the study's co-authors wrote.
For this analysis, the researchers also explored the distribution of metastatic sites in lung cancer patients, revealing that metastases frequently occurred in locations such as lung parenchyma, bone, liver and adrenal glands. This information has the potential to enhance the understanding of disease progression and guide more effective treatment planning. Further, the researchers believe it points to the useful role LLMs may play in developing new care guidelines, though they express some doubt given the technology is in its infancy.
“These results stress the potential of prompt-learning large language models (LLMs) to facilitate effortless information retrieval from free-text radiology reports without requiring expert knowledge in language model development or task-specific fine tuning. However, it remains to be seen whether third-party applications, such as GPT-4, are appropriate for handling potentially sensitive data in healthcare,” the authors wrote. “With the emergence of competing open-source LLMs in the future, there is potential to integrate such models into a protected infrastructure within a hospital to enable scalable mining pipelines for radiologic data.”