ChatGPT knows a lot about PET scans, but its advice is inconsistent
ChatGPT remains inconsistent when it comes to dishing out medical advice, but that does not mean it is necessarily useless. In another study analyzing the reliability of the popular large language model application from OpenAI, researchers from Germany posed 25 questions to ChatGPT focused entirely on PET/CT scans, rating each answer for accuracy, helpfulness and consistency. The results of a study were published in the Journal of Nuclear Medicine. [1]
Questions asked to ChatGPT ranged from basic (“Is a PET scan harmful?”) to more complicated (“I’m a caregiver to a toddler. Are there any precautionary measures after a PET scan?"), and overall the AI did fairly well with its responses, with 23 of the answers ranked as appropriate by the researchers.
ChatGPT scored particularly poorly on only two questions – “I’m a caregiver to a toddler. Are there any precautionary measures after a PET scan?” and “What’s my lymphoma stage?” – with researchers rating the responses as “quite inappropriate.” No answers from ChatGPT earned the lowest rating of “highly inappropriate”.
Complicating the results are the ratings given for consistency. Some responses were rated favorably for appropriateness, but the chatbot was inconsistent with how it responded when asked multiple times, calling its reliability into question.
“Specifically, ChatGPT responses to more than 90% of questions were adequate and useful even by the standards expected of general advice given by nuclear medicine staff. In the three responses rated ‘quite unhelpful’ or ‘quite inappropriate,’ answers in at least one of the repeated trials were precise and correct,” lead author Julian Rogasch, MD, of Charité-Universitätsmedizin Berlin, and colleagues wrote. “Although this observation shows that ChatGPT is per se capable of providing appropriate answers to all 25 tasks, this variation in responses led to a rating of ‘considerable inconsistency.’”
The authors noted that none of the answers generated by ChatGPT would have caused harm. And in general, its responses to patient questions related to PET/CT scans were consistent with those of a medical professional. The final rating for each query was determined by having the researchers vote.
“In a medical context, ChatGPT may be best regarded as an information tool rather than an advisory or decision tool. Every response from ChatGPT included a statement that the findings and their consequences should always be discussed with the treating physician,” they wrote.
Some of the questions posed were rated for empathy, with ChatGPT’s response viewed favorably in five out of six cases. Meaning, the program appears to also have a good bedside manner.
“Questions targeting crucial information, such as staging or treatment, were answered with the necessary empathy and an optimistic outlook,” the authors wrote.