GPT-4 as accurate as neurologists in predicting final diagnosis based on MRI reports

As large language models continue to advance, so too does their potential to enhance diagnostic processes within the radiology space. Now, new research is suggesting that LLMs could predict diagnoses based on imaging reports at an accuracy in line with that of human physicians. 

Recently published in European Radiology, the study details how OpenAI’s star LLM, GPT-4, was able to surpass radiologists’ accuracy in predicting final diagnoses using preoperative MRI reports describing brain tumors. Researchers involved in the work signaled that their results indicate a role for LLMs in providing second opinions. 

“Within the realm of LLMs, the GPT series, in particular, has gained significant attention,” corresponding author Daiju Ueda, an associate professor at Osaka Metropolitan University’s Graduate School of Medicine in Japan, and co-authors noted. “Many applications have been explored within the field of radiology. Among these, the potential of GPT to assist in diagnosis from image findings is noteworthy because such capabilities could complement the essential aspects of daily clinical practice and education.” 

For the study, researchers first translated 150 preoperative brain MRI reports—compiled by either a radiologist or neurologist—from Japanese to English. GPT-4 and a group of five radiologists were given the textual findings from the reports and asked to provide a differential and final diagnosis. Those diagnoses were then compared alongside post-operative pathological analyses to determine accuracy. 

For final diagnoses, GPT-4 achieved an accuracy of 74%, while the radiologists’ predictions were accurate between 65% and 79% of the time. However, GPT-4's accuracy climbed to 80% when reports were written by neurologists, compared to 60% with reports from radiologists. 

Notably, GPT-4's accuracy was markedly better than the radiologists’ in terms of differential diagnosis, at 94%. In comparison, the radiologists’ highest accuracy was recorded at 89%. Differential diagnoses were consistent regardless of which provider wrote the report. 

The authors pointed out that their research is the first time GPT-4 has been challenged with actual clinical radiology reports. 

“The majority of previous research suggested the utility of GPT-4 in diagnostics, but these relied heavily on hypothetical environments such as quizzes from academic journals or examination questions,” the group noted. “This approach can lead to a cognitive bias since the individuals formulating the imaging findings or exam questions also possess the answers.” 

In contrast, using real clinical reports to test LLMs provides a more robust understanding of their accuracy and how they might perform in clinical settings. The group suggested that their findings indicate legitimate clinical potential for GPT-4. 

“The encouraging results of this study invite further evaluations of the LLM’s accuracy across a myriad of medical fields and imaging modalities," the authors wrote. “The end goal of such exploration is to pave the way for the development of more versatile, reliable, and powerful tools for healthcare.” 

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She joined Innovate Healthcare in 2021 and has since put her unique expertise to use in her editorial role with Health Imaging.

Around the web

The nuclear imaging isotope shortage of molybdenum-99 may be over now that the sidelined reactor is restarting. ASNC's president says PET and new SPECT technologies helped cardiac imaging labs better weather the storm.

CMS has more than doubled the CCTA payment rate from $175 to $357.13. The move, expected to have a significant impact on the utilization of cardiac CT, received immediate praise from imaging specialists.

The newly cleared offering, AutoChamber, was designed with opportunistic screening in mind. It can evaluate many different kinds of CT images, including those originally gathered to screen patients for lung cancer. 

Trimed Popup
Trimed Popup