ChatGPT shows 'significant promise' in guiding contrast-related decisions
With proper training, popular large language model ChatGPT could help guide radiologists on the use of contrast media in patients who might be at risk of an adverse reaction.
Researchers recently tested OpenAI’s latest ChatGPT model—GPT-4—to assess how its ability to provide information related to contrast media might change when trained on guideline-backed information. The group observed a significant improvement in the model’s accuracy after using real administration guidelines to train it.
Experts involved in the study suggested that the large language model’s ability to adapt when given updated information could indicate that it has a promising future as a support tool in clinical settings. This could be especially helpful when timely decisions relative to the use of contrast need to be made.
“Wrong administration can lead to complications like allergy or post contrast acute kidney injury, highlighting the importance of structured guidelines,” corresponding author Michael Scheschenja, with the department of diagnostic and interventional radiology at University Hospital Marburg in Germany, and co-authors noted. “These evidence-based guidelines provide clarity for radiologists in ambiguous situations. However, the detailed nature of these guidelines can be time-consuming in urgent clinical scenarios, leading to possible nonadherence with potential serious repercussions.”
For their study, the team initially tested the accuracy of GPT-4's recommendations without any added training by asking it 64 questions related to contrast administration. After that, a plug-in containing official contrast guidelines was used to train the model on specific situations.
Exposure to the guidelines significantly improved GPT-4's recommendation accuracy. The average quality rating of its responses rose from 3.98 to 4.33, while its utility scores climbed from 4.1 to 4.4.
Overall, 82.3% of GPT-4's recommendations were “highly” guideline adherent and 14% were considered “moderately” accurate.
“This implies that GPT-4 is not just capable of generating humanlike textual responses but can also extract relevant information from extensive documents,” the authors wrote.
While GPT-4 performed well overall, it must be noted that its responses were not without fault. There were several instances when its answers were graded as “insufficient” or “very bad” by seasoned radiologists. This highlights one of the major challenges hindering the advancement of large language models—their dependability heavily relies on the quality and specificity of information with which they are trained.
Direct training via plugins could help address the issue, but additional research to determine the reasons behind large language models’ fluctuating accuracy is still needed, the authors suggested.
The study abstract can be viewed in Current Problems in Diagnostic Radiology.