Natural language processing can limit report discrepancies between AI and radiologists
Neither radiologists nor artificial intelligence algorithms have a perfect track record when it comes to diagnostic accuracy, but quality assurance measures can help to bridge the gap in discordance between the two.
A new paper in the Journal of the American College of Radiology details how one radiology department was able to implement natural language processing software into its workflow to resolve inadvertent discordance between physicians and an AI decision support system (AI DSS). The software was used to flag certain CT exams when radiologists’ findings differed from that of the AI DSS or in instances when radiologists did not engage with decision support at all. While the NLP software did not often detect discordance between radiologists and AI DSS, it did uncover missed diagnoses on some high acuity CT scans that would have been consequential for patients had the findings not been identified.
The team highlighted the importance of understanding radiologists’ uptake of AI DSS software and how it affects clinical workflows.
“Once implemented in clinical practice, quality assurance and monitoring processes need to be embedded into the AI augmented radiology clinical workflow,” corresponding author M. Chekmeyan, with the Department of Radiology at UMass Chan Medical School, UMass Memorial Medical Center, and colleagues explained. “There is a limited body of literature comprised of studies that detail individual AI QA workflows applied in a retrospective fashion but a paucity of literature for real world guidance on what this looks like when prospectively initiated, on an institutional level.”
For their study, the group included all high acuity CT scans that took place at their institution over a 2.5 year timeframe. Both radiologists and AI DSS interpreted the images for intracranial hemorrhage, cervical spine fracture and pulmonary embolus. The scans were flagged if they were reported as negative by a radiologist, had a high probability of positive by AI DSS and when readers did not engage with decision support.
Of 111,674 scans, the workflow uncovered missed diagnoses just .02% of the time. A total of 12,412 CTs were prioritized as positive by AI DSS, .04% of which were discordant, unengaged, and flagged for QA; of those, 57% were true positives.
In 85% of discordant cases, addendums were made and communicated within 24 hours of being flagged.
While discrepancies between radiologists and AI DSS were rare in this study, the authors maintained that the NLP-based software’s ability to rapidly resolve discordance could limit the potential for missed diagnoses that would be very consequential on high acuity exams.
The study abstract is available here.