Experts highlight 'significant concerns' with fluctuating accuracy of popular large language models

The capabilities of large language models in different medical applications have come a long way since the emergence of popular LLM ChatGPT, but has their potential reached the point of clinical utility? 

Not quite yet, according to a new analysis. 

Numerous studies of LLMs have indicated that some are capable of providing expert-level medical advice, simplifying radiology reports and accurately answering patient queries. As these models continue to advance, it is important to understand how their performance holds up over time, as their integration into clinical settings hinges on their ability to be consistent, Mitul Gupta, with the University of Texas at Austin’s Dell Medical School, and co-authors write in the in the European Journal of Radiology. 

“While LLMs demonstrate potential across various medical applications, limited information exists regarding the accuracy of publicly available models like ChatGPT-4, ChatGPT-3.5, Bard, and Claude, specifically within the domain of radiology," the group explains. "These tools, while powerful search and reference aids, require a deeper understanding of their output accuracy, relevance and reliability to inform radiologists and trainees on their potential utility.” 

To get a better idea of how LLMs’ performances hold up over time, the team input questions from the ACR Diagnostic in Training Exam (DXIT) into several big name LLMs—GPT-4, GPT-3.5, Claude, and Google Bard—monthly from November 2023 to January 2024. The same questions were used for each query to determine whether the models’ accuracy wavered over time.  

Overall, GPT-4 achieved the greatest accuracy, at 78 ± 4.1 %, followed by Google Bard, Claude and GPT-3.5, at 73 ± 2.9 %, 71 ± 1.5 % and 63 ± 6.9 %. However, GPT-4's accuracy decreased throughout the study period by around 8%. Conversely, Claude’s increased by 3%, while GPT-3.5 and Bard's both fluctuated.

Intra-model discordance decreased throughout the study, indicating that their response consistency was improving. The group notes that the LLMs performed best on questions that called for broad interpretations, but did not do as well on prompts requiring more detailed, factual information.  

“A critical limitation of LLMs is their unknown ‘ground-truth' medical knowledge. In subspecialties like radiology, where rare diseases are encountered, poorly understood deep learning network approaches can lead to incorrect responses,” the authors explain. “Underlying phenomenon such as ‘hallucinations’ or ‘drift’ may contribute to false responses or incorrect extracted generalizations based on factual underlying scientific training data.” 

Though models achieved passing scores overall, their fluctuations in accuracy over time could be problematic in a field that is sustained by reliability. Their potential is promising, but LLMs cannot yet be safely integrated into clinical practice without significant oversight, as their inconsistency indicates a need for further training and refining, the authors suggest. 

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.

Around the web

CCTA is being utilized more and more for the diagnosis and management of suspected coronary artery disease. An international group of specialists shared their perspective on this ongoing trend.

The new technology shows early potential to make a significant impact on imaging workflows and patient care. 

Richard Heller III, MD, RSNA board member and senior VP of policy at Radiology Partners, offers an overview of policies in Congress that are directly impacting imaging.