Experts highlight 'significant concerns' with fluctuating accuracy of popular large language models
The capabilities of large language models in different medical applications have come a long way since the emergence of popular LLM ChatGPT, but has their potential reached the point of clinical utility?
Not quite yet, according to a new analysis.
Numerous studies of LLMs have indicated that some are capable of providing expert-level medical advice, simplifying radiology reports and accurately answering patient queries. As these models continue to advance, it is important to understand how their performance holds up over time, as their integration into clinical settings hinges on their ability to be consistent, Mitul Gupta, with the University of Texas at Austin’s Dell Medical School, and co-authors write in the in the European Journal of Radiology.
“While LLMs demonstrate potential across various medical applications, limited information exists regarding the accuracy of publicly available models like ChatGPT-4, ChatGPT-3.5, Bard, and Claude, specifically within the domain of radiology," the group explains. "These tools, while powerful search and reference aids, require a deeper understanding of their output accuracy, relevance and reliability to inform radiologists and trainees on their potential utility.”
To get a better idea of how LLMs’ performances hold up over time, the team input questions from the ACR Diagnostic in Training Exam (DXIT) into several big name LLMs—GPT-4, GPT-3.5, Claude, and Google Bard—monthly from November 2023 to January 2024. The same questions were used for each query to determine whether the models’ accuracy wavered over time.
Overall, GPT-4 achieved the greatest accuracy, at 78 ± 4.1 %, followed by Google Bard, Claude and GPT-3.5, at 73 ± 2.9 %, 71 ± 1.5 % and 63 ± 6.9 %. However, GPT-4's accuracy decreased throughout the study period by around 8%. Conversely, Claude’s increased by 3%, while GPT-3.5 and Bard's both fluctuated.
Intra-model discordance decreased throughout the study, indicating that their response consistency was improving. The group notes that the LLMs performed best on questions that called for broad interpretations, but did not do as well on prompts requiring more detailed, factual information.
“A critical limitation of LLMs is their unknown ‘ground-truth' medical knowledge. In subspecialties like radiology, where rare diseases are encountered, poorly understood deep learning network approaches can lead to incorrect responses,” the authors explain. “Underlying phenomenon such as ‘hallucinations’ or ‘drift’ may contribute to false responses or incorrect extracted generalizations based on factual underlying scientific training data.”
Though models achieved passing scores overall, their fluctuations in accuracy over time could be problematic in a field that is sustained by reliability. Their potential is promising, but LLMs cannot yet be safely integrated into clinical practice without significant oversight, as their inconsistency indicates a need for further training and refining, the authors suggest.