New scoring systems help grade the accuracy of AI-generated radiology reports

Artificial intelligence (AI) algorithms are being used more and more to generate radiology reports, but how can a radiologist trust that the results are accurate and complete?

An international team of researchers—including specialists from Harvard Medical School (HMS) and Stanford University—explored that very question, presenting its findings in Patterns.[1]

“Accurately evaluating AI systems is the critical first step toward generating radiology reports that are clinically useful and trustworthy,” senior author Pranav Rajpurkar, PhD, assistant professor of biomedical informatics with the Blavatnik Institute at HMS, said in a prepared statement.

Rajpurkar et al. started by evaluating different scoring metrics designed to grade AI-generated radiology reports. Automated scoring systems struggled at times, the group found, even missing key errors along the way.

As one might expect, a team of six human radiologists did much better when asked to grade the AI-generated reports—Rajpurkar and his team aimed to design a scoring systems that could deliver a performance comparable to that of radiologists. The group developed a new method for evaluating the algorithms that build radiology reports, RadGraph F1, and a new scoring system that combines multiple quality metrics into a single grade, RadCliQ. Both newly designed metrics, the authors wrote, “demonstrate stronger correlation with radiologists' evaluations” than previous scoring systems. They also described RadGraph F1 and RadCliQ as “meaningful metrics for guiding future research in radiology report generation.”

“Measuring progress is imperative for advancing AI in medicine to the next level,” co-first author Feiyang ‘Kathy’ Yu, a research associate in Rajpurkar’s lab, said in the same statement. “Our quantitative analysis moves us closer to AI that augments radiologists to provide better patient care.”

“By aligning better with radiologists, our new metrics will accelerate development of AI that integrates seamlessly into the clinical workflow to improve patient care,” Rajpurkar added.

Read the team’s analysis in Patterns here.

Michael Walter
Michael Walter, Managing Editor

Michael has more than 18 years of experience as a professional writer and editor. He has written at length about cardiology, radiology, artificial intelligence and other key healthcare topics.

Around the web

The nuclear imaging isotope shortage of molybdenum-99 may be over now that the sidelined reactor is restarting. ASNC's president says PET and new SPECT technologies helped cardiac imaging labs better weather the storm.

CMS has more than doubled the CCTA payment rate from $175 to $357.13. The move, expected to have a significant impact on the utilization of cardiac CT, received immediate praise from imaging specialists.

The newly cleared offering, AutoChamber, was designed with opportunistic screening in mind. It can evaluate many different kinds of CT images, including those originally gathered to screen patients for lung cancer. 

Trimed Popup
Trimed Popup