GPT-4 now has vision—can it actually read chest X-rays?
Finely tuned, pre-trained large language models are beginning to reliably translate image content into text, but are they ready to take on medical images?
Not quite yet. At least, not according to a new study published Tuesday in Radiology that put GPT-4 with vision (GPT-4V) to the test on a series of chest radiographs. The visual recognition skills of the large language model fell far short of clinical standards, achieving a positive predictive value (PPV) of less than 25% during its best attempt at trying to spot image findings from a set of 100 chest x-rays.
“The emergence of multimodal large language models (LLMs) that can understand both text and images, such as OpenAI’s GPT-4V, shows potential for automated image-text pair generation,” corresponding author Yifan Peng, PhD, an associate professor with the department of population health sciences at Weill Cornell Medicine, and colleagues noted. “Although GPT-4V has shown promise in understanding real-world images, its effectiveness in interpreting real-world chest radiographs was limited.”
To test GPT-4V's image interpretation skills, experts presented it with 100 different radiographs from two public datasets (one from the National Institutes of Health and one from the Medical Imaging and Data Resource Center). Two separate assessments took place—one in a zero-shot setting that did not provide the large language model with prior examples, and one in a few-shot setting that provided two examples. Outcomes were determined using ICD-10 codes for the images’ corresponding clinical conditions.
In the zero-shot setting, GPT-4V achieved a positive predictive value for detecting ICD-10 coded conditions of just over 12% and earned an average true positive rate (TPR) of 5.8% on the NIH dataset. On the MIDRC dataset, GPT-4V recorded its highest PPV (25%) and average TPR of 17%.
A slightly better performance was observed after GPT-4V was given image examples to learn from, but the improvement was not substantial.
Overall, GPT-4V was most accurate in identifying chest drains, air-space disease and lung opacity, and least able to detect endotracheal tubes, central venous catheters and degenerative changes of osseous structures.
“Our results highlight the need for additional comprehensive development and assessment prior to incorporating the GPT-4V model into clinical practice routines,” the authors wrote, adding that larger, more diverse datasets containing real-world images from multiple modalities are needed to more thoroughly train and validate these models.