Developing board-style radiology questions is resource intensive. Large language models could help
Radiology educators could soon enjoy the assistive benefits of large language models like ChatGPT, as new research highlights the promise of LLMs to create educational materials and board-style questions.
Published in Academic Radiology, the data details LLMs' great competence in drafting multiple choice questions, answers and rationales. Crafting these materials is typically left up to radiologists who draw from their own educational and clinical experiences, but the process is time consuming and can incur significant costs, authors of the new paper noted.
“A robust item bank for a 40-item computerized exam, administered twice a year, with 25 different forms for each administration and a maximum reuse rate of five for any test question over five years, requires at least 2,000 items,” corresponding author Scott J. Adams MD, PhD, with the department of medical imaging at Royal University Hospital in Canada, and co-authors explained. “The costs associated with developing such exam banks by physicians are substantial, ranging from $1,500 to $2,500 for a single test item. Consequently, the projection stands between $3,000,000 to $5,000,000 to develop an item bank for a single computer adaptive test.”
The substantial resources needed to compile these materials highlights a need to explore alternatives. One such alternative could be found within the realm of LLMs.
To determine the potential for LLMs to create radiology education materials, the team put two models—GPT-4 and Llama 2—to the task of developing 104 multiple-choice questions based on the American Board of Radiology exam blueprint. The queries produced were assessed by two board certified radiologists, who analyzed for clarity, relevance, suitability for a board exam based on difficulty, quality of distractors, and adequacy of rationale. They were then compared to ACR in-training exams.
Both models performed well, but GPT-4 achieved higher scores for all criteria than Llama 2. In fact, GPT-4's scores were neck and neck with the ACR DXIT questions, according to the blinded readers. GPT-4 also achieved 100% accuracy with its questions, compared to 69% for Llama 2.
“These findings suggest that GPT-4 holds promise as a valuable tool for enhancing exam preparation materials for radiology residents and expanding question banks for radiology board examinations,” the group wrote. “Further, results from this study suggest that the accessibility and scalability of LLM-generated questions hold the potential to address the perennial issue of limited resources for radiology education.”
Although the findings indicate promise for LLMs to enhance radiology education, the authors caution that they also highlight variability in model performance. The group suggested this should be considered and consistently reevaluated to effectively deploy LLMs in education.
The study abstract can be found here.