ChatGPT's medical advice may be deterring women from necessary imaging
ChatGPT’s ability to answer patients’ medical questions has improved substantially since the large language model’s inception, but experts caution that its advice still requires oversight.
A new analysis in Clinical Imaging details the capabilities of one of the newer versions of OpenAI’s popular LLM, ChatGPT3.5, revealing it can flexibly respond to prompts written in both lay language and that use standard medical terminology. However, the LLM’s responses may be misleading in certain situations, which could pose problems for patients who trust the medical advice it provides.
This latest research into the accuracy of ChatGPT’s medical expertise focuses on inquiries related to common breast symptoms, which often prompt individuals to seek information on the web.
“While disease screening and prevention are important components of healthcare, acute symptoms may more urgently prompt patients to seek answers from online resources,” Dana Ataya, MD, with the H. Lee Moffitt Cancer Center and Research Institute in Tampa, Florida, and colleagues explained. “For that reason, we evaluated the accuracy of ChatGPT's responses to common questions about management of various acute breast symptoms. We also explored whether the phrasing of the questions impacted the accuracy of the responses by using both standard medical terminology and colloquial language likely used by the average patient.”
For the study, experts formulated questions about common acute breast conditions based on American College of Radiology (ACR) Appropriateness Criteria (AC) and the team’s clinical expertise. Of these, seven addressed the most common acute symptoms, while nine covered those that are pregnancy-related and another four inquired about management and imaging recommendations for a palpable lumps. The questions were submitted to GPT3.5 three separate times, and its responses were evaluated by five fellowship-trained breast radiologists.
The LLM performed well with questions related to common acute conditions, yielding appropriate responses for all seven inquiries. This performance remained consistent regardless of whether the prompt was presented in lay language or using medical terminology.
Its performance slid significantly when GPT3.5 was prompted with questions about pregnancy-related symptoms; it provided appropriate responses in just three out of nine scenarios.
For recommendations on the management of palpable abnormalities, the LLM returned appropriate responses for three out of four prompts, yielding 75% accuracy. While this may seem like a noninferior performance, the authors cautioned that this particular subset of responses—when incorrect—could deter patients from seeking necessary medical evaluations.
“Overall, the responses provided information about mammography that minimized its role in the diagnostic workup of a palpable breast abnormality. Such information from a novel, technologically advanced resource has the potential to negatively impact a patient's decision making,” the authors cautioned. “This reinforces the view that physician input remains essential in contextualizing medical information provided by ChatGPT, a recurring conclusion of many studies evaluating the use of ChatGPT in medical decision-making.”
Learn more about the team’s findings here.