Large language models not quite ready for cancer staging responsibilities
Although large language models have seen rapid advancement in recent years, they still cannot compare to human radiologists when it comes to staging cancer using free text reports.
Cancer staging is usually done by clinicians who extract relevant information from imaging reports. An issue with this is that this information often comes from free text reports that were generated by radiologists of different experience levels, which could result in report variability and inconsistency. This can also be a time-consuming process, depending on the expertise of the provider performing the staging.
The advancement of LLMs has been suggested as a solution to this problem.
“Medical professionals themselves have varying levels of domain expertise and experience,” Yeon Joo Jeong, MD, PhD, from the Department of Radiology at the Research Institute for Convergence of Biomedical Science and Technology in South Korea, and colleagues noted. “LLMs could have utility in helping to extract lung cancer staging information from radiology reports. Such a role for LLMs could have varying benefit depending on the skill level of the medical professional who would otherwise perform the staging assessment.”
A recent study in the American Journal of Roentgenology evaluated the ability of different versions of OpenAI’s LLM, ChatGPT (GPT-4o, GPT-4, GPT-3.5), to stage lung cancer using chest CT and FDG PET/CT free-text reports. The resultant stage groups were compared alongside those of six human readers with varying experience levels— two fellowship-trained radiologists with lesser experience than the radiologists who determined the reference standard, two fellows and two residents.
Of the LLMs, GPT-4o had the highest staging accuracy, at 74.1%. It was followed closely by GPT-4, at 70.1%, while GPT-3.5’s performance was notably inferior, at 57.4% accuracy.
One resident achieved 65.7% accuracy, while the other reached 72.3%. Both the fellowship-trained radiologists and radiology fellows outperformed the LLMs. Accuracy of the fellowship-trained radiologists was recorded at 82.3% and 85.4%, while the fellows reached 75.6% and 77.7%.
GPT-4o had a tendency to over stage, while the human readers more often under staged cancers. The LLMs did seem to have one advantage of the human readers—their performances did not waiver due to report variability or exam type, while the humans’ accuracy did. The models also completed stagings in less time.
The authors suggested that their findings “do not support the use of LLMs at this time for lung cancer staging in place of trained expert healthcare professionals." Rather than comment on the future potential of LLMs in cancer staging settings, the group instead emphasized the importance of domain expertise for performing such complex specialized tasks.
The study abstract is available here.