Large language models not quite ready for cancer staging responsibilities

Although large language models have seen rapid advancement in recent years, they still cannot compare to human radiologists when it comes to staging cancer using free text reports. 

Cancer staging is usually done by clinicians who extract relevant information from imaging reports. An issue with this is that this information often comes from free text reports that were generated by radiologists of different experience levels, which could result in report variability and inconsistency. This can also be a  time-consuming process, depending on the expertise of the provider performing the staging. 

The advancement of LLMs has been suggested as a solution to this problem.  

“Medical professionals themselves have varying levels of domain expertise and experience,” Yeon Joo Jeong, MD, PhD, from the Department of Radiology at the Research Institute for Convergence of Biomedical Science and Technology in South Korea, and colleagues noted. “LLMs could have utility in helping to extract lung cancer staging information from radiology reports. Such a role for LLMs could have varying benefit depending on the skill level of the medical professional who would otherwise perform the staging assessment.” 

A recent study in the American Journal of Roentgenology evaluated the ability of different versions of OpenAI’s LLM, ChatGPT (GPT-4o, GPT-4, GPT-3.5), to stage lung cancer using chest CT and FDG PET/CT free-text reports. The resultant stage groups were compared alongside those of six human readers with varying experience levels— two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows and two residents. 

Of the LLMs, GPT-4o had the highest staging accuracy, at 74.1%. It was followed closely by GPT-4, at 70.1%, while GPT-3.5’s performance was notably inferior, at 57.4% accuracy. 

One resident achieved 65.7% accuracy, while the other reached 72.3%. Both the fellowship-trained radiologists and radiology fellows outperformed the LLMs. Accuracy of the fellowship-trained radiologists was recorded at 82.3% and 85.4%, while the fellows reached 75.6% and 77.7%. 

GPT-4o had a tendency to over stage, while the human readers more often under staged cancers. The LLMs did seem to have one advantage of the human readers—their performances did not waiver due to report variability or exam type, while the humans’ accuracy did. The models also completed stagings in less time. 

The authors suggested that their findings “do not support the use of LLMs at this time for lung cancer staging in place of trained expert healthcare professionals." Rather than comment on the future potential of LLMs in cancer staging settings, the group instead emphasized the importance of domain expertise for performing such complex specialized tasks. 

The study abstract is available here. 

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She joined Innovate Healthcare in 2021 and has since put her unique expertise to use in her editorial role with Health Imaging.

Around the web

The nuclear imaging isotope shortage of molybdenum-99 may be over now that the sidelined reactor is restarting. ASNC's president says PET and new SPECT technologies helped cardiac imaging labs better weather the storm.

CMS has more than doubled the CCTA payment rate from $175 to $357.13. The move, expected to have a significant impact on the utilization of cardiac CT, received immediate praise from imaging specialists.

The newly cleared offering, AutoChamber, was designed with opportunistic screening in mind. It can evaluate many different kinds of CT images, including those originally gathered to screen patients for lung cancer. 

Trimed Popup
Trimed Popup