Improving AI research papers poses challenges for scientists
The paper, published in 2017, evaluated a specific statistical method's accuracy on epidemiological data
The previous summer, Peter Degen’s postdoctoral mentor approached him with a peculiar issue: One of his academic papers was receiving an excessive number of citations. While citations are vital in academia, these were unusual.
According to The Verge, the paper, published in 2017, evaluated a specific statistical method's accuracy on epidemiological data and initially garnered a respectable couple of dozen citations over the years. Recently, however, it was being referenced persistently, hundreds of times, elevating it among the most referenced works in Degen's career. Although another professor might have been ecstatic, Degen's adviser urged him to delve deeper into the situation.
Degen, engaging in postdoctoral studies at the University of Zurich's Centre for Reproducible Science and Research Synthesis, noticed a trend among the citing papers. Similar to his original work, they were studying the Global Burden of Disease data set, a public resource compiled by the University of Washington's Institute for Health Metrics and Evaluation.
Yet, the papers were using this data to generate an infinite flow of forecasts: regarding the future risk of stroke for adults over 20, testicular cancer in young men, falls in elderly Chinese citizens, colorectal cancer in people with low-grain diets, and numerous other scenarios.
Upon investigating GitHub for the corresponding code for such analysis, Degen stumbled upon a Chinese social platform, Bilibili, revealing a Guangzhou-based company promoting tutorials to develop publishable research in less than two hours using their software and AI writing tools.
These studies lacked quality. When researchers examined a series of headache-related studies, they unearthed numerous inaccuracies and misrepresentations. However, these weren't as blatantly flawed as previous AI-generated papers, making them harder to weed out.
“The peer-review system is extremely burdened and already stretched,” Degen acknowledged. “There’s an avalanche of publications and insufficient peer reviewers. If LLMs lower the barrier for mass production of papers, we will reach a tipping point.”
Proponents of generative AI hold great expectations for its potential to drive future scientific advancements — boosting discoveries, curing cancers — yet, currently, the technology is undermining a core element of scientific research, drowning editors and reviewers in a flood of submissions. Ironically, as the technology improves in crafting competent papers, the more severe the crisis becomes.
For a decade, academic publishing has grappled with notorious “paper mills”; clandestine companies mass-producing papers and selling authorship to academics, physicians, or others seeking the competitive edge that comes with published papers on their profile. Publishers, often prodded by "science detectives" who specialise in uncovering fraudulent studies, have been closing loopholes only for the mills to exploit new weaknesses.
Generative AI empowered these mills, enabling them to bypass plagiarism checks by creating entirely new content. Although its distinctive hallucinations previously allowed publishers to theoretically filter most of their output, in practice, papers slipped through, leading to retractions when investigators encountered a rat's diagram with absurdly large genitals labelled "testtomcels" or texts littered with “as an AI assistant,” accidentally left in.
Nevertheless, AI has now advanced to a stage where it can independently create convincing papers, luring academics desperate to meet publication quotas to produce their own papers. The outcome is a massive influx of poorly constructed scientific works threatening to overwhelm the publishing, peer review, grant awarding, and current research infrastructure.
Matt Spick, a faculty member in health and biomedical data analysis at the University of Surrey and an associate editor at Scientific Reports, first witnessed this phenomenon when he reviewed three distinctly similar papers that examined the US National Health and Nutrition Examination Survey (NHANES), another public dataset.
A quick check on Google Scholar revealed that it wasn't mere coincidence: There was a sudden uptick in papers citing NHANES, all adhering to a comparable formula, each claiming to uncover links like the consumption of walnuts to cognitive capacity, or skim milk intake to depression.
“Sufficient computing power allows you to assess every conceivable pairwise association. Eventually, you find relationships not yet explored, and you publish: There's a correlation between this and that,” Spick explained.
Such correlations often oversimplify complex phenomena or result from random statistical anomalies. “One suggested that education years affect postoperative hernia complications. That’s just a random correlation. What should one do with that? Quit school early to avoid potential postoperative hernia complications?”
Over the years, detectives have formulated multiple techniques to identify fraudulent papers. This includes searching for “torturous terms,” where sneakily evading plagiarism checks by running existing text through a synonym generator often leads to technical terms like “reinforcement learning” being reduced to gibberish like “familiarising reinforcement.”
Other detectors search for copied visuals, perform author network assessments, or scrutinise citations for fabricated references, a classic indicator of LLM usage. Spick identifies clusters of papers adhering to the same pattern as they scrutinise public data sets.