Large language models not quite ready for cancer staging responsibilities

Hannah Murphy | September 13, 2024 | Health Imaging | Enterprise Imaging

Although large language models have seen rapid advancement in recent years, they still cannot compare to human radiologists when it comes to staging cancer using free text reports.

Cancer staging is usually done by clinicians who extract relevant information from imaging reports. An issue with this is that this information often comes from free text reports that were generated by radiologists of different experience levels, which could result in report variability and inconsistency. This can also be a time-consuming process, depending on the expertise of the provider performing the staging.

The advancement of LLMs has been suggested as a solution to this problem.

“Medical professionals themselves have varying levels of domain expertise and experience,” Yeon Joo Jeong, MD, PhD, from the Department of Radiology at the Research Institute for Convergence of Biomedical Science and Technology in South Korea, and colleagues noted. “LLMs could have utility in helping to extract lung cancer staging information from radiology reports. Such a role for LLMs could have varying benefit depending on the skill level of the medical professional who would otherwise perform the staging assessment.”

A recent study in the American Journal of Roentgenology evaluated the ability of different versions of OpenAI’s LLM, ChatGPT (GPT-4o, GPT-4, GPT-3.5), to stage lung cancer using chest CT and FDG PET/CT free-text reports. The resultant stage groups were compared alongside those of six human readers with varying experience levels— two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows and two residents.

Of the LLMs, GPT-4o had the highest staging accuracy, at 74.1%. It was followed closely by GPT-4, at 70.1%, while GPT-3.5’s performance was notably inferior, at 57.4% accuracy.

One resident achieved 65.7% accuracy, while the other reached 72.3%. Both the fellowship-trained radiologists and radiology fellows outperformed the LLMs. Accuracy of the fellowship-trained radiologists was recorded at 82.3% and 85.4%, while the fellows reached 75.6% and 77.7%.

GPT-4o had a tendency to over stage, while the human readers more often under staged cancers. The LLMs did seem to have one advantage of the human readers—their performances did not waiver due to report variability or exam type, while the humans’ accuracy did. The models also completed stagings in less time.

The authors suggested that their findings “do not support the use of LLMs at this time for lung cancer staging in place of trained expert healthcare professionals." Rather than comment on the future potential of LLMs in cancer staging settings, the group instead emphasized the importance of domain expertise for performing such complex specialized tasks.

The study abstract is available here.

Automated feedback improves trainee reports, especially during after-hours

GPT-4 is better at explaining IR procedures than physicians

ChatGPT's medical writing is getting so good that it may soon fool AI detectors

Hannah Murphy

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.

Around the web

Cardiovascular Business

GE HealthCare launches new cardiac CT scanner with advanced AI capabilities

GE HealthCare designed the new-look Revolution Vibe CT scanner to help hospitals and health systems embrace CCTA and improve overall efficiency.

Cardiovascular Business

Bracco updates HeartSee coronary flow capacity software with new diagnostic features

Clinicians have been using HeartSee to diagnose and treat coronary artery disease since the technology first debuted back in 2018. These latest updates, set to roll out to existing users, are designed to improve diagnostic performance and user access.

Cardiovascular Business

Key trends in diagnostic heart testing: CT on the rise as some traditional techniques fall out of favor

The cardiac technologies clinicians use for CVD evaluations have changed significantly in recent years, according to a new analysis of CMS data. While some modalities are on the rise, others are being utilized much less than ever before.

Large language models not quite ready for cancer staging responsibilities

Automated feedback improves trainee reports, especially during after-hours

GPT-4 is better at explaining IR procedures than physicians

ChatGPT's medical writing is getting so good that it may soon fool AI detectors

Related Content

Around the web