New research offers reminder of why ChatGPT should not be used for second opinions

A paper published April 30 in Radiology details the missteps of a trio of large language models (LLMs) tasked with assigning BI-RADS categories based on information provided in breast imaging reports.

In many cases, the LLMs’ categorizations would have negatively affected patients, further highlighting the need for increased vigilance in monitoring the use of these tools in medical settings, authors of the study cautioned.

“Evaluating the abilities of generic LLMs remains important as these tools are the most readily available and may unjustifiably be used by both patients and nonradiologist physicians seeking a second opinion,” study co-lead author Andrea Cozzi, MD, PhD, post-doctoral research fellow at the Imaging Institute of Southern Switzerland, said in a release. “The results of this study add to the growing body of evidence that reminds us of the need to carefully understand and highlight the pros and cons of LLM use in healthcare.”

BI-RADS categories are used to rate image findings and guide any subsequent medical testing that might be needed. Though certainly helpful, interreader agreement for BI-RADS category assignments varies. Natural language processing tools have shown promise in addressing these inconsistencies. 

For this study, experts presented three popular LLMs—GPT 3.5 and GPT-4 (the most recent versions of OpenAI’s ChatGPT) and Google Gemini (formerly Bard)—with 2,400 breast imaging radiology reports written in three different languages (Italian, English and Dutch). The models were tasked with assigning BI-RADS categories using information derived from those reports, which were based on findings in mammogram, MRI and ultrasound exams. The models’ performance were then compared alongside that of seasoned breast radiologists. 

Both the original reporting radiologists and the reviewing readers recorded nearly perfect interreader agreement, while the LLMs achieved moderate agreement with the original reports. The reviewing radiologists’ interpretations resulted in BI-RADS categories being either upgraded or downgraded for around 5% of the reports, but the three LLMs changed categories for close to 25% of the reports on average. 

Up to 18% of the LLMs’ category reassignments would have negatively affected clinical management of findings, with Bard recording the most mistakes, the authors noted. 

Although these tools have proven themselves valuable in numerous settings, including some involving medical information, they must be used with caution, the authors wrote, especially by patients and nonradiologist providers who may be seeking a second opinion. 

“These programs can be a wonderful tool for many tasks but should be used wisely. Patients need to be aware of the intrinsic shortcomings of these tools, and that they may receive incomplete or even utterly wrong replies to complex questions.” 

The study abstract is available here

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.

Around the web

CCTA is being utilized more and more for the diagnosis and management of suspected coronary artery disease. An international group of specialists shared their perspective on this ongoing trend.

The new technology shows early potential to make a significant impact on imaging workflows and patient care. 

Richard Heller III, MD, RSNA board member and senior VP of policy at Radiology Partners, offers an overview of policies in Congress that are directly impacting imaging.