Why practices might want to think twice before using ChatGPT to create patient education materials
Practices considering utilizing ChatGPT to create educational pamphlets for patients might want to think twice before using the large language model to cut corners.
According to a new analysis in Academic Radiology, ChatGPT-generated educational materials intended to inform patients on various interventional radiology procedures are inaccurate at best, and misleading at worst.
“Despite the remarkable capabilities of ChatGPT and similar LLMs, they are not without their challenges, particularly in the realm of healthcare. The potential for disseminating inaccurate information and the occurrence of 'hallucinations'—responses that are generated without grounding in factual data—are significant concerns,” Arash Bedayat, MD, with the department of radiological sciences at the David Geffen School of Medicine, UCLA, and co-authors caution. “These hallucinations, as they are termed, refer to instances where the model produces information that is entirely fabricated or not verifiable against the training data.”
For the study, experts tested the large language model’s knowledge of interventional radiology procedures by having five users (three radiologists and two radiologists-in-training) prompt it to create educational pamphlets on 20 common IR procedures using identical commands. Two independent radiologists then assessed the materials for accuracy, quality and consistency.
A vast, but misconstrued vocabulary
ChatGPT consistently referenced appropriate medical terminology, but the materials it produced using said terminology were rampant with inaccuracies related to the procedures. Experts observed issues in 30% of the pamphlets generated by the large language model.
The most common inaccuracies were related to information on potential procedural complications and whether sedation was required.
“The omission of sedation information can result in uninformed consent, where patients are not fully aware of the procedure's experience or the risks involved," the group cautions. "The absence of pre-procedural preparation details could lead to procedural delays or increased risks during the procedure.”
A line-by-line comparison also revealed inconsistencies between the materials, with their structure and formatting having significant variations, despite different users prompting ChatGPT with the exact same commands.
“One of the major obstacles in adopting ChatGPT in healthcare is the need for up-to-date and current medical data,” the authors note. “This underscores the importance of ongoing human supervision and expert validation in utilizing large language models for medical educational purposes.”
Future studies should address how to fine tune the data large language models like ChatGPT utilize, the authors suggest. Some studies have begun to test the utility of plug-ins containing data specific to a certain topic (radiology appropriateness criteria, for example) for training large language models to provide accurate health-related information.
The study abstract can be found here.