Science Has a Slop Problem. AI Didn't Start It.
Journals are being overrun with AI-generated slop. This is a real concern — poorly prompted LLMs produce generic, padded, citation-hallucinating prose that wastes reviewers’ time and degrades the literature. But framing this as a new problem misses the deeper one.
Scientific publishing has had a slop problem for decades. It predates large language models by a generation. Scientists are incentivised to publish papers — for tenure, for grants, for career advancement. The incentive is to publish, not to report faithfully. The result is a literature full of p-hacked results and buried negative findings. The replication crisis didn’t emerge because AI started writing papers. It emerged because career incentives reward publication volume over reporting accuracy.
AI writing can make this worse, if used the way it’s currently being used: covertly, with no audit trail. A researcher prompts ChatGPT to polish a manuscript, submits it as their own work, and the reader has no way to know what was human-written and what was machine-generated. This is the worst of both worlds — you keep the misaligned human incentives and add the capacity for AI to help dress them up.
But the problem was never that humans write papers. The problem is that the people who design and run studies are the same people whose careers depend on those results looking good. Authorship and incentive are entangled, and the entanglement produces systematic bias. This is a principal-agent problem: the scientific community wants faithful reporting, but the individual researcher’s career incentives diverge from that goal.
Reallocating labour
Humans have a comparative advantage over AI at scientific review. Evaluating whether a methodology is sound and whether the claims follow from the evidence requires domain expertise and the kind of critical reading that current AI models do poorly. Reviewers are the quality layer of science, and right now they’re unpaid and undervalued.
AI, conversely, has a useful property for scientific writing: it has no career. It won’t bury a negative result because it doesn’t care about impact factors. This doesn’t mean AI writing is unproblematic — LLMs have their own failure modes, from hallucinated citations to generic hedging. But these are engineering problems, addressable through prompt design. And crucially, the prompts themselves become part of the publication’s audit trail. A reader can inspect exactly what instructions the model was given and judge whether they enforce faithful reporting. You can’t do that with a human author’s internal motivations.
So let AI do the writing, under controlled conditions, and redirect human effort towards review. Not because AI writes better — it doesn’t — but because it writes without the incentive distortions that cause publication bias.
Measuring what matters: an r-index for reviewers
If we want to reallocate scientific labour towards review, we need to measure and reward it. Currently we don’t — reviewing is invisible work that earns no credit.
A provisional idea: a researcher has r-index r if they’ve reviewed r papers that each went on to receive at least r citations. The intuition: good reviewers improve papers, and improved papers get cited more. It’s the h-index, but for reviewing instead of authoring.
It’s not perfect. An “accept” on an already-great paper earns undeserved credit, and correctly rejecting a flawed study scores zero. But it’s a starting point — and with open peer review, where reviews are published and attributable, reviews themselves become citable documents. At that point they enter the same citation economy as papers. Ideas and critical thinking go into reviews too; there’s no reason they shouldn’t accumulate scholarly credit.
Platforms that publish reviews openly and maintain immutable audit trails can actually measure review impact. We incentivise what we measure. Time to start measuring reviewing.
What controlled conditions look like
Saying “let AI write the paper” without constraints would be irresponsible. The control comes from the pipeline surrounding the generation:
Pre-registration as contract. The researcher submits a study design before collecting data — hypotheses, analysis plan, variables, inference criteria. This is the specification the AI writes against. Deviations from the plan are flagged in the generated paper, not hidden.
Section-by-section generation with cumulative context. The paper is generated one section at a time, each building on the previous. The system prompt enforces faithful reporting — precise statistical language, contradictions stated clearly, deviations from the pre-registration flagged prominently.
Citation verification. One of the known failure modes of LLM-generated academic text is hallucinated references. After generation, every citation is checked against Semantic Scholar and CrossRef. Each reference gets a confidence score and a visible badge — green for verified, yellow for partial match, red for unverified. The failure mode becomes visible rather than hidden.
Open, attributed peer review. Reviews are published alongside the paper with the reviewer’s name attached. This inverts the traditional model: reviewing becomes a public intellectual contribution with a byline, not invisible unpaid labour. When review is visible and credited, the incentive shifts towards doing it well.
A complete audit trail. Every editorial decision — reviewer matching, review compilation, consistency checks, status transitions — is logged and displayed on the published article. The reader can inspect not just what the AI produced but how it got there.
The audit trail changes the trust model
The common objection to AI-authored science is trust: how do I know this paper is reliable if a machine wrote it?
But “a human wrote it” was never actually a guarantee of reliability — it was a proxy, and a leaky one. The audit trail offers something more concrete. Instead of trusting that the author reported faithfully (which the replication crisis shows we often can’t), the reader can verify the chain: pre-registration against results data, generation logs against the pre-registered plan, peer reviews against the generated paper. Every link is inspectable.
This is more transparent than the current system, not less. Traditional publishing hides most of these steps — pre-registrations are optional, raw data is often unavailable, and peer review is anonymous and unpublished. The AI pipeline makes all of it visible by design.
What this doesn’t solve
This approach addresses the incentive distortion in reporting — the step between having results and communicating them. It doesn’t fix problems upstream — poorly designed studies or fabricated data. If a researcher submits fabricated data, the AI will faithfully report fabricated findings.
It also doesn’t replace human judgement in interpretation. The Discussion section of an AI-generated paper is the weakest part — it can summarise findings and compare them to prior work, but it lacks the domain intuition to identify the most consequential implications. The platform addresses this in two ways: the researcher’s own motivation — why this study matters, what question it addresses — is carried verbatim into the paper’s Introduction, and after acceptance, reviewers can publish perspective commentaries alongside the article. The AI reports the findings; the humans frame what they mean.
A working prototype
I’ve built this as a working platform: AI Open Access Journal (source on GitHub). It supports four study types (empirical, simulation, replication, negative results), generates papers using Claude, verifies citations against Semantic Scholar and CrossRef, and publishes the full chain — pre-registration through to peer reviews and audit trail — alongside the article.
It’s a prototype, not a production journal. But it’s a proof of concept for reallocating scientific labour — humans focus on study design and review, where their judgement is irreplaceable, while AI handles the reporting step where human incentives cause the most damage.
AI is already involved in scientific writing. The remaining question is whether we design that involvement to be transparent and accountable, or let it happen in the shadows, amplifying the incentive problems we already have

