Misuse of public imaging data is producing 'overly optimistic' results in machine learning research

Hannah Murphy | March 23, 2022 | Health Imaging | Artificial Intelligence

Misuse of available public data could lead to biased results in research that analyzes the utility of machine learning in medical imaging.

Such “off label” use happens when public data published for one task are used to train algorithms for a different function. New research published in Proceedings of the National Academy of Sciences analyzed how this happens, as well as the accompanying consequences.

“This work reveals that such off-label usage could lead to biased, overly optimistic results of machine-learning algorithms,” lead author Efrat Shimron, from the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, and co-authors disclosed. “The underlying cause is that public data are processed with hidden processing pipelines that alter the data features.”

Some datasets freely available to the public include pre-processed, rather than raw, images. Consequently, they contain less data, which becomes problematic when researchers use these images to develop reconstruction algorithms.

The researchers used two processing pipelines typical to open access databases to study their impact on three well-known MRI reconstruction algorithms (compressed sensing, dictionary learning, and DL) when applied to both raw and processed images.

The experts explained that when using processed data, images produced by the algorithm were clearer and sharper and, in some cases, up to 48% better than images constructed from raw data. This can create biased results when algorithms are unknowingly trained on processed data.

“Our main observation is that bias stems from the unintentional coupling of hidden data-processing pipelines with later retrospective subsampling experiments,” the authors wrote. “The data processing implicitly improves the inverse problem conditioning, and the retrospective subsampling enables the algorithms to benefit from that.”

The authors suggested guidelines to help avoid such overinflated AI study results, including the important recommendation that data curators provide detailed descriptions of all their processing steps, among others.

“We call for attention of researchers and reviewers: Data usage and pipeline adequacy should be considered carefully, reproducible research should be encouraged, and research transparency should be required,” the experts said.

More on artificial intelligence in imaging:

Legal ramifications to consider when integrating AI into daily radiology practice

Why radiologist virtue is so important in the AI era: 6 pieces of advice

Radiogenomics could personalize cancer care, but experts are still hesitant to embrace the method

Transparent AI platform shows radiologists its decision-making blueprint for diagnosing breast cancer

Hannah Murphy

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She began covering the medical imaging industry for Innovate Healthcare in 2021.

Around the web

Cardiovascular Business

GE HealthCare launches new cardiac CT scanner with advanced AI capabilities

GE HealthCare designed the new-look Revolution Vibe CT scanner to help hospitals and health systems embrace CCTA and improve overall efficiency.

Cardiovascular Business

Bracco updates HeartSee coronary flow capacity software with new diagnostic features

Clinicians have been using HeartSee to diagnose and treat coronary artery disease since the technology first debuted back in 2018. These latest updates, set to roll out to existing users, are designed to improve diagnostic performance and user access.

Cardiovascular Business

Key trends in diagnostic heart testing: CT on the rise as some traditional techniques fall out of favor

The cardiac technologies clinicians use for CVD evaluations have changed significantly in recent years, according to a new analysis of CMS data. While some modalities are on the rise, others are being utilized much less than ever before.

Misuse of public imaging data is producing 'overly optimistic' results in machine learning research

Related Content

Around the web