arxiv: 2605.07855 · v1 · submitted 2026-05-08 · 📊 stat.AP

Recognition: no theorem link

Jagged AI in Scientific Peer Review: Evidence from POMP Data Analysis

Edward L. Ionides, Jin Wook Lee, William Szegda, Zhisheng Song

Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3

classification 📊 stat.AP

keywords AI peer reviewjagged AIPOMP modelsstate-space modelsmechanistic modelsscientific peer reviewtime series analysisAI limitations

0 comments

The pith

AI reviewers catch technical errors in POMP analyses that humans miss but fall short on interpretive and narrative checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests AI tools on the task of reviewing 72 anonymized student projects that fit mechanistic dynamic models to time series data using partially observed Markov processes. It shows that AI agents are stronger than human reviewers at spotting overlooked technical mistakes in code and invalid statistical inference methods. At the same time, the same agents fall below human standards when assessing interpretive mistakes, whether the story holds together, and whether the model critique reflects domain knowledge. The uneven strengths appear across different ways of instructing the AI, which suggests the pattern comes from the model itself rather than from how it is prompted. Readers care because the result indicates where AI can already assist peer review and where human input remains necessary.

Core claim

AI reviewing agents exhibited a jagged capability profile: they proficiently caught human-overlooked technical errors and invalid inference methodology but did not match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions.

What carries the argument

Comparison of four AI agents (Claude Code with differing skill-file instructions) against human peer reviews on 72 POMP student projects, measuring performance separately on technical error detection versus interpretive and narrative assessment.

If this is right

AI can supplement human peer review by identifying technical implementation errors and flawed inference steps in dynamic model fitting that human reviewers sometimes miss.
Changing the instructions given to AI agents can shift which specific weaknesses they display but cannot remove the overall jagged profile.
Quality assessment of mechanistic models gains from AI assistance focused on algorithm correctness and statistical validity.
The uniformity of jaggedness across instruction variants indicates that prompt engineering alone is insufficient to achieve balanced AI review performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Review workflows that route technical checks to AI and interpretive or narrative checks to humans could reduce reviewer workload without lowering standards.
Efforts to improve AI narrative understanding would be required before AI could serve as a full substitute for human peer review in this domain.
Repeating the analysis on published research papers instead of course projects would test whether the observed jaggedness holds outside student work.

Load-bearing premise

The 72 anonymized student POMP projects and their human peer reviews form a representative testbed for AI performance in scientific peer review of mechanistic dynamic models.

What would settle it

If the same AI agents, applied to a larger collection of professional rather than student POMP analyses, achieve human-comparable performance on interpretive, narrative, and domain-critique items, the claim of model-inherent jaggedness would be contradicted.

read the original abstract

Despite their growing use in academic writing and statistical analysis, the performance of artificial intelligence (AI) tools in scientific peer review remains a largely unexplored area. A key challenge is jagged AI, a phenomenon where AI exhibits strong ability spikes in some domains while remaining deficient in others. To study this jaggedness in a practical data science context, we considered the task of reviewing partially observed Markov process (POMP) data analyses. POMP models, also known as state-space models or hidden Markov models, are used to fit mechanistic dynamic models to time series data in diverse applications including disease transmission, ecological dynamics, and financial risk assessment. Quality peer review in this area entails assessment of scientific context, identification of errors in implementing complex algorithms, and decisions concerning methodological best practices. We studied 72 POMP projects from four semesters of a University of Michigan graduate time series course for which the project reports, the source code, and student peer reviews are anonymized and open-access. We compared the human reviews with four AI reviewing agents, using Claude Code with differing instructions implemented as skill files. We found that AI reviewers exhibited a jagged capability profile, proficiently catching human-overlooked technical errors and invalid inference methodology, while failing to match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was found to be similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions. Skill file configuration shifted which weaknesses agents emphasized, without removing the jaggedness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows AI catching technical errors in student POMP reviews better than peers but missing on interpretation, with the pattern consistent across agents, though the student context limits how much this tells us about real scientific review.

read the letter

The main point here is that four Claude-based AI agents reviewing 72 anonymized student POMP projects from a Michigan grad course caught technical implementation errors and invalid inference steps that the human student reviewers missed, while falling short on narrative coherence, interpretive errors, and domain-specific model critique. The jagged profile stayed similar no matter how the skill files were adjusted, which the authors take as evidence that the gap is baked into the model rather than the instructions. They make the reports, code, and reviews open, which is useful for anyone wanting to check or extend the work. That concrete setup with real course data is the clearest new piece; prior AI review studies have been more general or simulated, so this gives a specific look at mechanistic modeling tasks in time series. The consistency across agents is a solid observation within the experiment. The main limitation is the testbed. Student projects tend to have simpler coding mistakes and shallower domain interpretation than papers sent to journals in ecology or epidemiology, and the human baseline comes from other students rather than experts. That setup can exaggerate the technical edge for AI while making the interpretive shortfalls look more pronounced than they would against professional reviews. Without more detail on how they scored the categories or checked agreement, it's hard to gauge how robust the jaggedness numbers are. The stress-test note about non-representativeness holds up from the description. This is useful for people working on AI tools for stats education or technical review in dynamic models, but less so for broad claims about scientific peer review. It has enough empirical grounding and open materials to merit referee time, with the expectation that revisions would tighten the methods and discuss generalizability.

Referee Report

3 major / 2 minor

Summary. The paper analyzes AI performance in peer review of 72 anonymized student POMP (partially observed Markov process) data analysis projects from a University of Michigan graduate course. It compares reviews from four Claude-based AI agents (configured via differing skill files) against human student peer reviews, claiming that AI exhibits a jagged profile: strong at detecting technical errors and invalid inference methods overlooked by humans, but weaker at identifying interpretive errors, assessing narrative coherence, and offering domain-informed model critiques. The jaggedness pattern is reported as consistent across agents, indicating it stems primarily from the underlying model rather than specific instructions.

Significance. If the central empirical comparison holds after methodological clarification, the work offers concrete evidence on current LLM limitations in data-science peer review tasks, distinguishing technical/methodological strengths from interpretive weaknesses. The open-access anonymized dataset and multi-agent design with varying instructions are strengths that support reproducibility and allow testing of prompt effects. This could inform hybrid review workflows or targeted AI improvements in mechanistic modeling contexts, though the educational testbed limits broader claims about professional scientific peer review.

major comments (3)

[Data and Methods] Data and Methods: The manuscript provides no explicit description of the error classification taxonomy (technical vs. interpretive vs. narrative/domain), the procedure for identifying 'human-overlooked' errors, or any inter-rater reliability assessment for how reviews were scored or categorized. Without these details or examples, the quantitative basis for the jaggedness claim cannot be evaluated.
[Results] Results: The conclusion that jaggedness is 'primarily a property of the underlying AI model rather than the specific instructions' is based on observing similar patterns across four agents. No quantitative similarity metric, correlation across performance categories, or statistical comparison between configurations is reported to substantiate this over a qualitative impression.
[Introduction and Discussion] Introduction and Discussion: The title and abstract frame the study as evidence for jagged AI in 'scientific peer review,' yet the testbed uses student course projects and student reviewers. This context likely features simpler implementation errors and lower baseline expertise than professional reviews of published work; the paper does not quantify or discuss how this affects generalizability of the observed profile.

minor comments (2)

[Abstract] Abstract: The phrase 'Claude Code with differing instructions implemented as skill files' is unclear without a brief definition or reference to what skill files entail in this implementation.
[Methods] The manuscript would benefit from a table summarizing the four agent configurations and their key instruction differences to aid reader understanding of the 'similar jaggedness' finding.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights areas where the manuscript can be clarified and strengthened. We address each major comment below, indicating where revisions will be made to improve transparency and rigor while preserving the core empirical findings.

read point-by-point responses

Referee: [Data and Methods] The manuscript provides no explicit description of the error classification taxonomy (technical vs. interpretive vs. narrative/domain), the procedure for identifying 'human-overlooked' errors, or any inter-rater reliability assessment for how reviews were scored or categorized. Without these details or examples, the quantitative basis for the jaggedness claim cannot be evaluated.

Authors: We agree that the Methods section would benefit from greater explicitness on these points. In the revised manuscript we will add a dedicated subsection describing: (1) the full error classification taxonomy with definitions and examples drawn from the POMP reviews; (2) the exact procedure used to flag human-overlooked errors (systematic side-by-side comparison of AI and human review texts, with discrepancies coded only when the AI identified a verifiable technical or methodological issue absent from all human reviews); and (3) the consensus-based scoring process employed by the author team, including any informal reliability checks performed. This addition will allow readers to evaluate the quantitative comparisons directly. revision: yes
Referee: [Results] The conclusion that jaggedness is 'primarily a property of the underlying AI model rather than the specific instructions' is based on observing similar patterns across four agents. No quantitative similarity metric, correlation across performance categories, or statistical comparison between configurations is reported to substantiate this over a qualitative impression.

Authors: The referee correctly notes that the similarity claim rests on qualitative pattern consistency rather than formal metrics. While the four agents were configured with deliberately divergent skill files, we did not compute cross-agent correlations or similarity indices in the original analysis. In revision we will add a supplementary table reporting pairwise correlations (or other appropriate similarity measures) of the per-category performance scores across the four agents, together with a brief statistical note on the consistency of the jagged profile. This will provide quantitative support for the claim that the underlying model, rather than prompt configuration, drives the observed pattern. revision: yes
Referee: [Introduction and Discussion] The title and abstract frame the study as evidence for jagged AI in 'scientific peer review,' yet the testbed uses student course projects and student reviewers. This context likely features simpler implementation errors and lower baseline expertise than professional reviews of published work; the paper does not quantify or discuss how this affects generalizability of the observed profile.

Authors: We acknowledge that the educational setting introduces limits on direct extrapolation to professional peer review. The revised Discussion will explicitly address this by (a) describing the nature of the POMP projects (graduate-level but still course-based), (b) noting that the technical-versus-interpretive jaggedness may be more pronounced or attenuated in expert reviews of published work, and (c) framing the results as evidence for a capability profile that can inform hybrid workflows rather than a definitive characterization of all scientific peer review. We cannot, however, provide a quantitative adjustment for generalizability without new data from professional reviews, which lies outside the current study scope. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical comparison without derivations or self-referential constructions

full rationale

This is an observational empirical study that directly compares AI-generated peer reviews against human student reviews for 72 anonymized POMP course projects. No mathematical derivations, equations, fitted parameters, or predictions appear in the reported analysis. The central claim of a jagged AI capability profile is presented as an observed pattern in the collected data rather than derived from any self-definition, ansatz, or self-citation chain. The study is self-contained against its own testbed; generalizability concerns (e.g., student vs. professional context) are limitations of scope, not circular reductions of the reported findings to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the student projects and the validity of human reviews as a benchmark; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 72 anonymized POMP student projects and peer reviews constitute a valid and representative sample for evaluating AI peer-review capabilities in mechanistic dynamic modeling.
The study generalizes from this specific course data to broader claims about jagged AI in scientific peer review.

pith-pipeline@v0.9.0 · 5584 in / 1310 out tokens · 32378 ms · 2026-05-11T02:08:11.885846+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Anthropic. 2025. Sub-agents, https://code.claude.com/docs/en/sub-agents . Accessed February, 2026

work page 2025
[2]

Dell'Acqua, Fabrizio, Edward McFowland III, Ethan Mollick, et al. 2026. ``Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality.'' Organization Science 37 (2): 403--23. https://doi.org/10.1287/orsc.2025.21838

work page doi:10.1287/orsc.2025.21838 2026
[3]

Ionides, Edward L., Dao Nguyen, Yves Atchadé, Stilian Stoev, and Aaron A. King. 2015. ``Inference for Dynamic and Latent Variable Models via Iterated, Perturbed B ayes Maps.'' Proceedings of the National Academy of Sciences of USA 112 (3): 719-\/-724. https://doi.org/10.1073/pnas.1410597112

work page doi:10.1073/pnas.1410597112 2015
[4]

King, Aaron A, Dao Nguyen, and Edward L Ionides. 2016. ``Statistical Inference for Partially Observed Markov Processes via the R Package Pomp.'' Journal of Statistical Software 69 (12): 1--43. https://doi.org/10.18637/jss.v069.i12

work page doi:10.18637/jss.v069.i12 2016
[5]

Liang, Weixin, Yuhui Zhang, Hancheng Cao, et al. 2024. ``Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis.'' Transactions on Machine Learning Research. https://arxiv.org/abs/2310.01783

work page arXiv 2024
[6]

Liu, Ryan, and Nihar B Shah. 2023. ReviewerGPT ? A n Exploratory Study on Using Large Language Models for Paper Reviewing . https://arxiv.org/abs/2306.00622

work page arXiv 2023
[7]

Morris, Meredith Ringel, Dan Altman, Haydn Belfield, et al. 2026. Characterizing Model Jaggedness Supports Safety and Usability. https://www-cs.stanford.edu/ merrie/papers/jaggedness_preprint.pdf

work page 2026
[8]

Vaccaro, Michelle, Abdullah Almaatouq, and Thomas Malone. 2024. ``When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis.'' Nature Human Behaviour 8 (12): 2293--303

work page 2024
[9]

Wheeler, Jesse, Anna Rosengart, Zhuoxun Jiang, Kevin Tan, Noah Treutle, and Edward L. Ionides. 2024. ``Informing Policy via Dynamic Models: Cholera in H aiti.'' PLOS Computational Biology 20: e1012032. https://doi.org/10.1371/journal.pcbi.1012032. CSLReferences Supplementary Material sec-supp Tables Table tbl-raw-counts and Table tbl-themes summarize the ...

work page doi:10.1371/journal.pcbi.1012032 2024