Hybrid method may improve dataset quality, quantity for deep learning
A hybrid technique combining natural language processing (NLP) and IBM Watson can accurately label free-text pathology reports, according to a new Journal of Digital Imaging study. The method may improve the quality and quantity of large-scale datasets for deep learning.
Most medical imaging datasets don’t contain enough cases to adequately train algorithms, wrote lead author Hari M. Trivedi, with the University of California, San Francisco, and colleagues. Developing large datasets, with proper structuring and annotation, typically involves long manual effort, perhaps even clinical trials.
“The sheer number of cases required for effective deep learning makes these types of manual methods unfeasible, if not impossible,” authors wrote.
Trivedi and colleagues used 9,898 breast pathology reports taken from more than 7,000 women from 1997 to 2014. Report types were less than 1,024 characters and included breast specimens, fine needle aspirations, core biopsies, lumpectomies and mastectomies. The “final diagnosis” section was only used in the analysis, they noted.
An expert manually annotated more than 3,500 reports by class, as left positive, right positive, bilateral positive or negative.
Researchers used what they called an “annotation framework” that combined traditional NLP and IBM Watson’s Natural Language Classifier (NLC).
Results showed traditional NLP and Watson’s NLC each performed well across all classes with overall F-measures over 0.96.
“This technique significantly accelerates the rate of extraction of meaningful data from clinical free-text reports and has important implications for improving the quantity and quality of large-scale datasets available for deep learning,” authors wrote.
Both methods performed poorly classifying bilateral positives, which authors said was due to the “paucity” of training examples.
Trivedi and colleagues argued “black-box” solutions, such as IBM Watson, are more user-friendly compared to traditional NLPs. While powerful, the latter technique requires more technical knowledge and programming experience that clinicians might not have.
Researchers maintained their results were promising, but more work is needed to expand the usage of such methods.
“Future work will focus on expanding this process to other medical records such as radiology reports and clinical notes as well as testing other automated solutions from Facebook, Google, Amazon and Microsoft,” authors wrote. “We hope to design an automated pipeline for large-scale clinical data annotation so that existing clinical records can be efficiently utilized for development of deep learning algorithms.”