Prompts matter: How input elements affect large language model performance
Just like radiologists, the performance of large language models improves when given proper context.
For LLMs, that context comes in the form of prompts containing patient data, but newer multimodal models also can analyze image data. These models have garnered great interest among the radiology community for their potential to assist with generating reports, imaging protocols and differential diagnoses based on case presentations. However, the jury is still out with regard to their clinical utility, as research on multimodal LLM performance has produced mixed results.
The authors of a new analysis in the journal Radiology suggest this could be due to the limited input combinations studied thus far.
“These studies evaluated queries with only images or clinical information and images as input. Importantly, the provided types of model input were shown to have a relevant impact on diagnostic performance,” Su Hwan Kim, from the Institute of Diagnostic and Interventional Neuroradiology at the Technical University of Munich, and colleagues explain. “One study showed higher diagnostic performance with multimodal input (image and medical history) compared with text-only or image-only prompts. However, more granular variations in input elements, such as textual descriptions of image findings and image annotations, have not been investigated systematically.”
The group conducted an analysis of popular multimodal LLM GPT-4V, which assesses data from both text and images, to get a better understanding of how different input elements affect the performance of these models when generating differential diagnoses. Four types of inputs were studied: images without modifiers, annotations, medical histories and image descriptions. Using multiple combinations of these inputs, GPT-4V was tasked with generating differential diagnoses for brain MRI scans.
The prompts that contained all four elements achieved the highest accuracy, at 69%. Accuracy declined significantly when image descriptions were removed from the equation, while prompts that lacked annotated images or used annotation alone yielded the weakest performance. Image descriptions and medical histories were found to be the most influential on model accuracy.
“One possible explanation for GPT-4V’s low performance with image input alone is that its training data likely contains only a limited quantity of radiologic images with high-quality labels, unlike abundant textual content on radiologic imaging features of various diagnoses,” the authors explain.
The group says their findings highlight the importance of radiologist descriptions, even for multimodal LLMs that analyze image data.
“More specialized models tailored to medicine and radiology are needed, although advancements in this field are contingent on the availability of large, high-quality multimodal datasets that can be used for model training and validation,” they suggest, adding that it also is worth studying how specialized prompt engineering training for imaging specialists could improve diagnostic accuracy of multimodal LLMs in the future.
Learn more about the model’s performance in the study abstract.