New research offers reminder of why ChatGPT should not be used for second opinions

Hannah Murphy | May 01, 2024 | Health Imaging | Enterprise Imaging

A paper published April 30 in Radiology details the missteps of a trio of large language models (LLMs) tasked with assigning BI-RADS categories based on information provided in breast imaging reports.

In many cases, the LLMs’ categorizations would have negatively affected patients, further highlighting the need for increased vigilance in monitoring the use of these tools in medical settings, authors of the study cautioned.

“Evaluating the abilities of generic LLMs remains important as these tools are the most readily available and may unjustifiably be used by both patients and nonradiologist physicians seeking a second opinion,” study co-lead author Andrea Cozzi, MD, PhD, post-doctoral research fellow at the Imaging Institute of Southern Switzerland, said in a release. “The results of this study add to the growing body of evidence that reminds us of the need to carefully understand and highlight the pros and cons of LLM use in healthcare.”

BI-RADS categories are used to rate image findings and guide any subsequent medical testing that might be needed. Though certainly helpful, interreader agreement for BI-RADS category assignments varies. Natural language processing tools have shown promise in addressing these inconsistencies.

For this study, experts presented three popular LLMs—GPT 3.5 and GPT-4 (the most recent versions of OpenAI’s ChatGPT) and Google Gemini (formerly Bard)—with 2,400 breast imaging radiology reports written in three different languages (Italian, English and Dutch). The models were tasked with assigning BI-RADS categories using information derived from those reports, which were based on findings in mammogram, MRI and ultrasound exams. The models’ performance were then compared alongside that of seasoned breast radiologists.

Both the original reporting radiologists and the reviewing readers recorded nearly perfect interreader agreement, while the LLMs achieved moderate agreement with the original reports. The reviewing radiologists’ interpretations resulted in BI-RADS categories being either upgraded or downgraded for around 5% of the reports, but the three LLMs changed categories for close to 25% of the reports on average.

Up to 18% of the LLMs’ category reassignments would have negatively affected clinical management of findings, with Bard recording the most mistakes, the authors noted.

Although these tools have proven themselves valuable in numerous settings, including some involving medical information, they must be used with caution, the authors wrote, especially by patients and nonradiologist providers who may be seeking a second opinion.

“These programs can be a wonderful tool for many tasks but should be used wisely. Patients need to be aware of the intrinsic shortcomings of these tools, and that they may receive incomplete or even utterly wrong replies to complex questions.”

The study abstract is available here.

GPT-4 confidently struggles on radiology exam

Natural language processing spots reporting gaps, racial bias

GPT-4 unlocks key insights from free-text radiology reports

Natural language processing can limit report discrepancies between AI and radiologists

Hannah Murphy

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.

Around the web

Cardiovascular Business

GE HealthCare launches new cardiac CT scanner with advanced AI capabilities

GE HealthCare designed the new-look Revolution Vibe CT scanner to help hospitals and health systems embrace CCTA and improve overall efficiency.

Cardiovascular Business

Bracco updates HeartSee coronary flow capacity software with new diagnostic features

Clinicians have been using HeartSee to diagnose and treat coronary artery disease since the technology first debuted back in 2018. These latest updates, set to roll out to existing users, are designed to improve diagnostic performance and user access.

Cardiovascular Business

Key trends in diagnostic heart testing: CT on the rise as some traditional techniques fall out of favor

The cardiac technologies clinicians use for CVD evaluations have changed significantly in recent years, according to a new analysis of CMS data. While some modalities are on the rise, others are being utilized much less than ever before.

New research offers reminder of why ChatGPT should not be used for second opinions

GPT-4 confidently struggles on radiology exam

Natural language processing spots reporting gaps, racial bias

GPT-4 unlocks key insights from free-text radiology reports

Natural language processing can limit report discrepancies between AI and radiologists

Related Content

Around the web