New research offers reminder of why ChatGPT should not be used for second opinions

A paper published April 30 in Radiology details the missteps of a trio of large language models (LLMs) tasked with assigning BI-RADS categories based on information provided in breast imaging reports.

In many cases, the LLMs’ categorizations would have negatively affected patients, further highlighting the need for increased vigilance in monitoring the use of these tools in medical settings, authors of the study cautioned.

“Evaluating the abilities of generic LLMs remains important as these tools are the most readily available and may unjustifiably be used by both patients and nonradiologist physicians seeking a second opinion,” study co-lead author Andrea Cozzi, MD, PhD, post-doctoral research fellow at the Imaging Institute of Southern Switzerland, said in a release. “The results of this study add to the growing body of evidence that reminds us of the need to carefully understand and highlight the pros and cons of LLM use in healthcare.”

BI-RADS categories are used to rate image findings and guide any subsequent medical testing that might be needed. Though certainly helpful, interreader agreement for BI-RADS category assignments varies. Natural language processing tools have shown promise in addressing these inconsistencies. 

For this study, experts presented three popular LLMs—GPT 3.5 and GPT-4 (the most recent versions of OpenAI’s ChatGPT) and Google Gemini (formerly Bard)—with 2,400 breast imaging radiology reports written in three different languages (Italian, English and Dutch). The models were tasked with assigning BI-RADS categories using information derived from those reports, which were based on findings in mammogram, MRI and ultrasound exams. The models’ performance were then compared alongside that of seasoned breast radiologists. 

Both the original reporting radiologists and the reviewing readers recorded nearly perfect interreader agreement, while the LLMs achieved moderate agreement with the original reports. The reviewing radiologists’ interpretations resulted in BI-RADS categories being either upgraded or downgraded for around 5% of the reports, but the three LLMs changed categories for close to 25% of the reports on average. 

Up to 18% of the LLMs’ category reassignments would have negatively affected clinical management of findings, with Bard recording the most mistakes, the authors noted. 

Although these tools have proven themselves valuable in numerous settings, including some involving medical information, they must be used with caution, the authors wrote, especially by patients and nonradiologist providers who may be seeking a second opinion. 

“These programs can be a wonderful tool for many tasks but should be used wisely. Patients need to be aware of the intrinsic shortcomings of these tools, and that they may receive incomplete or even utterly wrong replies to complex questions.” 

The study abstract is available here

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She joined Innovate Healthcare in 2021 and has since put her unique expertise to use in her editorial role with Health Imaging.

Around the web

The nuclear imaging isotope shortage of molybdenum-99 may be over now that the sidelined reactor is restarting. ASNC's president says PET and new SPECT technologies helped cardiac imaging labs better weather the storm.

CMS has more than doubled the CCTA payment rate from $175 to $357.13. The move, expected to have a significant impact on the utilization of cardiac CT, received immediate praise from imaging specialists.

The newly cleared offering, AutoChamber, was designed with opportunistic screening in mind. It can evaluate many different kinds of CT images, including those originally gathered to screen patients for lung cancer. 

Trimed Popup
Trimed Popup