Can large language models break language barriers in radiology reports?

Large language models could be the key to breaking the language barrier between patients and providers, new findings suggest. 

Even without language barriers, radiology reports contain complex medical information that can be difficult for the average patient to understand. Translating these reports from an interpreting radiologist’s language to the one spoken by a patient is feasible, though the quality of these translations varies widely. With the growing demand for virtual care and an increasingly mobile population post-pandemic, the need for solutions to language barriers in medical settings is substantial. 

The authors of a new paper in the journal Radiology suggest that large language models could provide a workaround for situations when human interpreters are unavailable. 

“As human translators may not always be readily available, especially with medical imaging expertise, artificial intelligence–based models, particularly large language models, offer promising alternatives. Initially designed for general language processing tasks, transformer-based models such as T5 (Google AI) have demonstrated promising results in various applications, including translation,” co-senior author Keno Bressem, with the Institute for Cardiovascular Radiology and Nuclear Medicine at the German Heart Center of Munich, and colleagues note. “However, with the recent emergence of large decoder-only language models, such as generative pretrained transformer (i.e., GPT) models, the translation of domain-specific text, especially in the field of radiology, has remained unexplored.” 

The team put 10 popular LLMs to the test, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), by having them translate 100 fake free-text radiology reports from CT and MRI scans into 9 target languages. The reports had previously been translated by 18 radiologists. The models’ accuracy was determined using BiLingual Evaluation Understudy (BLEU) scores, translation error rates (TER) and CHaRacter-level F-score (chrF++) metrics. 

Overall, GPT-4 yielded the best translations, but the LLM operated particularly well when translating English to German, Greek, Thai and Turkish. GPT-4's predecessor, GPT-3.5, was most accurate at translating English to French, while Qwen1.5 was best for English to Chinese translations and Mixtral 8x22B performed best for Italian to English translations. 

Combined, the larger LLMs’ results were deemed clear, readable and consistent with the original reports, though there were some discrepancies with specific medical terminology, which could risk providing conflicting information to patients, the authors note. 

The performances largely hinged on their training data, while some also were inhibited by context length limitations and inefficient tokenization, the group adds. 

“Large language models exhibit high potential for translating medical reports, but there is no one-size-fits-all model," the team writes, adding that their analysis is strictly experimental. "Model medical translation performance depends on the chosen model and target language, with larger models generally outperforming smaller ones." 

Learn more about the study here. 

Hannah murhphy headshot

In addition to her background in journalism, Hannah also has patient-facing experience in clinical settings, having spent more than 12 years working as a registered rad tech. She joined Innovate Healthcare in 2021 and has since put her unique expertise to use in her editorial role with Health Imaging.

Around the web

The new technology shows early potential to make a significant impact on imaging workflows and patient care. 

Richard Heller III, MD, RSNA board member and senior VP of policy at Radiology Partners, offers an overview of policies in Congress that are directly impacting imaging.
 

The two companies aim to improve patient access to high-quality MRI scans by combining their artificial intelligence capabilities.