GPT-4 unlocks key insights from free-text radiology reports

Chad Van Alstin | September 21, 2023 | Health Imaging | Enterprise Imaging

Large language models (LLMs) are playing an increasingly significant role in the extraction of valuable information from free-text medical records, offering clinicians a new tool for data-mining clinically relevant insights. New research focused on how well LLMs can extract relevant data related to lung cancer shows they may be pretty good at navigating free-text notes – maybe even better than AI platforms designed specifically for the task.

A new study involved a comparative analysis of ChatGPT, the popular free chatbot, and its paid underlying technology upgrade, GPT-4, to assess the effectiveness of both in extracting and labeling oncologic phenotypes from free-text CT reports related to lung cancer. The results are published in Radiology. [1]

The study began with a comprehensive dataset of 424 free-text CT reports from 424 unique patients. All records in this retrospective study are from patients who underwent lung cancer follow-up CTs between September 2021 and March 2023. After the LLMs completed their tasks of analyzing the CT reports, the results were measured by four participating radiologists, acting in concert as curators of the research.

The findings showed that both ChatGPT and GPT-4 are mostly successful at properly extracting and labeling relevant information related to cancer traits, but the latter is superior to its free-to-consumer counterpart across the board at:

Lesion parameter extraction: GPT-4 achieved an accuracy rate of 98.6% in extracting lesion parameters, surpassing ChatGPT, which achieved an accuracy rate of 84.0%.
Metastatic disease identification: GPT-4 demonstrated superior accuracy in identifying metastatic disease, achieving a score of 98.1%, compared to ChatGPT's score of 90.3%.
Oncologic progression labeling: GPT-4 also excelled in generating correct labels for oncologic progression, achieving an F1 score of 0.96, while ChatGPT scored 0.91.
Oncologic reasoning: GPT-4 exhibited higher Likert scale scores for factual correctness (4.3 vs. 3.9) and accuracy (4.4 vs. 3.3). Importantly, it had a significantly lower rate of confabulation (1.7% vs. 13.7%) compared to ChatGPT.

The researchers noted that, in this instance, the LLMs actually outperformed Natural Language Processing (NLP), a type of machine-learning used by healthcare organizations to extract information not typically inputted into an EHR.

“The performance of both models in advanced medical reasoning was also evident in the generation of report-level labels for disease progression (F1 score, 0.86–0.96), a classification task that proved difficult for NLP models in a recent study with F1 scores ranging from 0.67 to 0.73,” lead author Matthias Fink, MD, and the study's co-authors wrote.

For this analysis, the researchers also explored the distribution of metastatic sites in lung cancer patients, revealing that metastases frequently occurred in locations such as lung parenchyma, bone, liver and adrenal glands. This information has the potential to enhance the understanding of disease progression and guide more effective treatment planning. Further, the researchers believe it points to the useful role LLMs may play in developing new care guidelines, though they express some doubt given the technology is in its infancy.

“These results stress the potential of prompt-learning large language models (LLMs) to facilitate effortless information retrieval from free-text radiology reports without requiring expert knowledge in language model development or task-specific fine tuning. However, it remains to be seen whether third-party applications, such as GPT-4, are appropriate for handling potentially sensitive data in healthcare,” the authors wrote. “With the emergence of competing open-source LLMs in the future, there is potential to integrate such models into a protected infrastructure within a hospital to enable scalable mining pipelines for radiologic data.”

AI can generate usable radiology reports, but it requires a human touch

Rad AI announces new artificial intelligence-powered software for radiology reports

GPT-4, a new upgrade from the team behind ChatGPT, can help doctors with difficult diagnoses

Latest version of ChatGPT has potential as a clinical decision support tool

AI program ChatGPT now has a published article in Radiology—is it any good?

Chad Van Alstin

Chad is an award-winning writer and editor with over 15 years of experience working in media. He has a decade-long professional background in healthcare, working as a writer and in public relations.

Around the web

Cardiovascular Business

GE HealthCare launches new cardiac CT scanner with advanced AI capabilities

GE HealthCare designed the new-look Revolution Vibe CT scanner to help hospitals and health systems embrace CCTA and improve overall efficiency.

Cardiovascular Business

Bracco updates HeartSee coronary flow capacity software with new diagnostic features

Clinicians have been using HeartSee to diagnose and treat coronary artery disease since the technology first debuted back in 2018. These latest updates, set to roll out to existing users, are designed to improve diagnostic performance and user access.

Cardiovascular Business

Key trends in diagnostic heart testing: CT on the rise as some traditional techniques fall out of favor

The cardiac technologies clinicians use for CVD evaluations have changed significantly in recent years, according to a new analysis of CMS data. While some modalities are on the rise, others are being utilized much less than ever before.

GPT-4 unlocks key insights from free-text radiology reports

Related Content:

AI can generate usable radiology reports, but it requires a human touch

Rad AI announces new artificial intelligence-powered software for radiology reports

GPT-4, a new upgrade from the team behind ChatGPT, can help doctors with difficult diagnoses

Latest version of ChatGPT has potential as a clinical decision support tool

AI program ChatGPT now has a published article in Radiology—is it any good?

Related Content

Around the web