ChatGPT is overly worried about ED patients

Chad Van Alstin | October 08, 2024 | Health Imaging | Clinical Research

artificial intelligence healthcare industry digest

Photo by Emiliano Vittoriosi via Unsplash

If ChatGPT were in charge of an emergency department (ED), patients would likely end up with a bill for unnecessary tests and treatment, a study published in Nature Communications found. [1]

Researchers from the University of California, San Francisco (UCSF) tested the ability of GPT-3.5 and GPT-4—the underlying models of ChatGPT—to make snappy decisions about care in an emergency setting, only to find it tends to overprescribe antibiotics, order too many X-rays and admit patients to the hospital unnecessarily.

The popular AI chatbot—which has shown an aptitude for clinical decision making in other studies—was outperformed by a resident ED doctor, even when prompted in ways that tended to make it more accurate.

“[Our research sends] a valuable message to clinicians not to blindly trust these models,” the study's lead author Chris Williams, MD, a postdoctoral researcher at UCSF, said in a statement. “ChatGPT can answer medical exam questions and help draft clinical notes, but it’s not currently designed for situations that call for multiple considerations, like the situations in an emergency department.”

Previous research from Williams showed ChatGPT was slightly better than a human clinician at determining what emergency patient should be triaged first based on acute illness or injury, but for this retroactive study, the AI was given more complicated, less-binary decision-making tasks.

Using a set of 251,401 archival records of emergency department visits, Williams and his team selected 1,000, making sure to precisely match the ratio of ordered X-rays, administered antibiotics and hospital admissions seen in the emergency room at UCSF Health.

ChatGPT’s underlying models were each given access to the records, with the researchers manually entering free-text or hand-written notes from attending physicians. This was done to ensure the AI had access to all the same examinations and clinical findings as a human physician. Then, the models were asked to make one of three decisions: Give a patient antibiotics, order an X-ray or another medical imaging test; or admit them for a hospital stay.

From there, the researchers measured the accuracy of the models to correctly determine a clinical course of action, using a series of four increasingly detailed prompts to guide them.

When compared to a physician, ChatGPT was typically overzealous at recommending all three courses of action. Both 4.0 and 3.5 were less-reliable than a resident doctor, with a measured 8% and 24% reduction in accuracy, respectively. So while GPT-4 outperformed its predecessor, it was ultimately inferior to a human.

Blame the Internet

Williams said ChatGPT's weakness is that it's trained using the Internet, which tends to be packed with unreliable medical information. Further, clinical websites tend to push patients to see a doctor and seek treatment, erring on the side of caution.

“These models are almost fine-tuned to say, ‘seek medical advice,’ which is quite right from a general public safety perspective,” he said. “But erring on the side of caution isn’t always appropriate in the ED setting, where unnecessary interventions could cause patients harm, strain resources and lead to higher costs for patients.”

In their paper detailing the study, the authors said ChatGPT is far too prone to false-positive suggestions to be recommended for use in an emergency care setting. The AI will require a better framework before it could ever be deployed, one that strikes a balance between not missing a clinical sign and not seeing too many of them.

“It is unclear, however, what is the best balance of sensitivity/specificity to strive for amongst clinical LLMs—it is likely that this balance will differ based on the particular task,” Williams, et al., wrote. “The increase in LLM specificity, at the expense of sensitivity, across our iterations of prompt engineering suggests that improvements could be made bespoke to the task, though the extent to which prompt engineering alone may improve performance is unclear.”

The study was funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development, and the National Institutes of Health.

It can be read in full at the reference link below.

ChatGPT is 'robbing applicants of their voices' in residency applications

Large language models not quite ready for cancer staging responsibilities

GPT-4 is better at explaining IR procedures than physicians

ChatGPT's medical writing is getting so good that it may soon fool AI detectors

How GPT-4 can improve radiology resident report feedback

Chad Van Alstin

Chad is an award-winning writer and editor with over 15 years of experience working in media. He has a decade-long professional background in healthcare, working as a writer and in public relations.