Experts highlight 'significant concerns' with fluctuating accuracy of popular large language models

Hannah Murphy | November 21, 2024 | Health Imaging | Artificial Intelligence

A team of cardiologists from Cleveland Clinic and Stanford University recently tested ChatGPT, the popular artificial intelligence (AI) model, to see if it could accurately answer questions about preventive cardiology and cardiovascular disease. The model performed well, only missing a handful of questions, and the researchers concluded that ChatGPT showed considerable potential. Cleveland Clinic cardiologist Ashish Sarraju, MD, was the lead author of that study. #ChartGPTChat GPT

The capabilities of large language models in different medical applications have come a long way since the emergence of popular LLM ChatGPT, but has their potential reached the point of clinical utility?

Not quite yet, according to a new analysis.

Numerous studies of LLMs have indicated that some are capable of providing expert-level medical advice, simplifying radiology reports and accurately answering patient queries. As these models continue to advance, it is important to understand how their performance holds up over time, as their integration into clinical settings hinges on their ability to be consistent, Mitul Gupta, with the University of Texas at Austin’s Dell Medical School, and co-authors write in the in the European Journal of Radiology.

“While LLMs demonstrate potential across various medical applications, limited information exists regarding the accuracy of publicly available models like ChatGPT-4, ChatGPT-3.5, Bard, and Claude, specifically within the domain of radiology," the group explains. "These tools, while powerful search and reference aids, require a deeper understanding of their output accuracy, relevance and reliability to inform radiologists and trainees on their potential utility.”

To get a better idea of how LLMs’ performances hold up over time, the team input questions from the ACR Diagnostic in Training Exam (DXIT) into several big name LLMs—GPT-4, GPT-3.5, Claude, and Google Bard—monthly from November 2023 to January 2024. The same questions were used for each query to determine whether the models’ accuracy wavered over time.

Overall, GPT-4 achieved the greatest accuracy, at 78 ± 4.1 %, followed by Google Bard, Claude and GPT-3.5, at 73 ± 2.9 %, 71 ± 1.5 % and 63 ± 6.9 %. However, GPT-4's accuracy decreased throughout the study period by around 8%. Conversely, Claude’s increased by 3%, while GPT-3.5 and Bard's both fluctuated.

Intra-model discordance decreased throughout the study, indicating that their response consistency was improving. The group notes that the LLMs performed best on questions that called for broad interpretations, but did not do as well on prompts requiring more detailed, factual information.

“A critical limitation of LLMs is their unknown ‘ground-truth' medical knowledge. In subspecialties like radiology, where rare diseases are encountered, poorly understood deep learning network approaches can lead to incorrect responses,” the authors explain. “Underlying phenomenon such as ‘hallucinations’ or ‘drift’ may contribute to false responses or incorrect extracted generalizations based on factual underlying scientific training data.”

Though models achieved passing scores overall, their fluctuations in accuracy over time could be problematic in a field that is sustained by reliability. Their potential is promising, but LLMs cannot yet be safely integrated into clinical practice without significant oversight, as their inconsistency indicates a need for further training and refining, the authors suggest.

Hannah Murphy

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.

Around the web

Cardiovascular Business

GE HealthCare launches new cardiac CT scanner with advanced AI capabilities

GE HealthCare designed the new-look Revolution Vibe CT scanner to help hospitals and health systems embrace CCTA and improve overall efficiency.

Cardiovascular Business

Bracco updates HeartSee coronary flow capacity software with new diagnostic features

Clinicians have been using HeartSee to diagnose and treat coronary artery disease since the technology first debuted back in 2018. These latest updates, set to roll out to existing users, are designed to improve diagnostic performance and user access.

Cardiovascular Business

Key trends in diagnostic heart testing: CT on the rise as some traditional techniques fall out of favor

The cardiac technologies clinicians use for CVD evaluations have changed significantly in recent years, according to a new analysis of CMS data. While some modalities are on the rise, others are being utilized much less than ever before.

Experts highlight 'significant concerns' with fluctuating accuracy of popular large language models

Related Content

Around the web