GPT-4 as accurate as neurologists in predicting final diagnosis based on MRI reports
As large language models continue to advance, so too does their potential to enhance diagnostic processes within the radiology space. Now, new research is suggesting that LLMs could predict diagnoses based on imaging reports at an accuracy in line with that of human physicians.
Recently published in European Radiology, the study details how OpenAI’s star LLM, GPT-4, was able to surpass radiologists’ accuracy in predicting final diagnoses using preoperative MRI reports describing brain tumors. Researchers involved in the work signaled that their results indicate a role for LLMs in providing second opinions.
“Within the realm of LLMs, the GPT series, in particular, has gained significant attention,” corresponding author Daiju Ueda, an associate professor at Osaka Metropolitan University’s Graduate School of Medicine in Japan, and co-authors noted. “Many applications have been explored within the field of radiology. Among these, the potential of GPT to assist in diagnosis from image findings is noteworthy because such capabilities could complement the essential aspects of daily clinical practice and education.”
For the study, researchers first translated 150 preoperative brain MRI reports—compiled by either a radiologist or neurologist—from Japanese to English. GPT-4 and a group of five radiologists were given the textual findings from the reports and asked to provide a differential and final diagnosis. Those diagnoses were then compared alongside post-operative pathological analyses to determine accuracy.
For final diagnoses, GPT-4 achieved an accuracy of 74%, while the radiologists’ predictions were accurate between 65% and 79% of the time. However, GPT-4's accuracy climbed to 80% when reports were written by neurologists, compared to 60% with reports from radiologists.
Notably, GPT-4's accuracy was markedly better than the radiologists’ in terms of differential diagnosis, at 94%. In comparison, the radiologists’ highest accuracy was recorded at 89%. Differential diagnoses were consistent regardless of which provider wrote the report.
The authors pointed out that their research is the first time GPT-4 has been challenged with actual clinical radiology reports.
“The majority of previous research suggested the utility of GPT-4 in diagnostics, but these relied heavily on hypothetical environments such as quizzes from academic journals or examination questions,” the group noted. “This approach can lead to a cognitive bias since the individuals formulating the imaging findings or exam questions also possess the answers.”
In contrast, using real clinical reports to test LLMs provides a more robust understanding of their accuracy and how they might perform in clinical settings. The group suggested that their findings indicate legitimate clinical potential for GPT-4.
“The encouraging results of this study invite further evaluations of the LLM’s accuracy across a myriad of medical fields and imaging modalities," the authors wrote. “The end goal of such exploration is to pave the way for the development of more versatile, reliable, and powerful tools for healthcare.”