Perplexity scores improve identification of fraudulent AI writings
Perplexity scores could be used to identify medical abstracts that have been fraudulently put together using artificial intelligence.
Advanced large language models, such as OpenAI’s GPT-4, have displayed significant potential for assisting with the compilation and editing of medical articles. Since 2023, there has been a significant increase LLM-assisted articles, but some have been produced through “paper mills” or “content mills” that have lower standards when it comes to accuracy and quality. There also have been reports of LLMs producing fraudulent or deceptive papers and citing fake sources.
With this in mind, experts have been working to develop guidelines pertaining to the ethical use of LLMs in medical writing. While numerous web-based AI tools have been created to flag published works that appear suspicious for AI authorship, the performances of these tools has been inconsistent thus far.
The authors of a new paper in Academic Radiology propose the use of perplexity scores as a possible solution.
“Perplexity scores, a metric originally introduced to assess the predictability and complexity of text, have been employed in various studies to evaluate the likelihood of text being generated by an LLM as opposed to a human,” Alperen Elek, from the Ege University Faculty of Medicine, and colleagues explained. “However, the application of perplexity scores to academic abstracts, particularly within the field of radiology, has not been thoroughly investigated.”
For their assessment, the team used the full text from 50 previously published radiology research articles to prompt LLMs to create new abstracts. They then calculated perplexity scores for each abstract, with lower scores being more indicative of AI authorship due to the predictability of the text. Three tools developed to flag AI writings were tested for their accuracy in distinguishing between human- and AI-written work.
The human-written perplexity scores were higher than those achieved by AI. Overall, the scores yielded an AUC of 0.78 for differentiating between the two. In line with prior studies on the topic, the team also observed significant variability in the performances of the tools developed to spot fraudulent work, with one correctly spotting AI-generated papers just 36% of the time and another being accurate 95% of the time.
“This highlights the need to develop effective and efficient tools for detecting AI-generated content. Determining human-written from AI-generated text may be difficult for human reviewers,” the authors noted, adding that, while there is potential for these scores to help spot fraudulent work, more research is needed to determine how the scores could be further refined.
Learn more about the findings here.