Recognition: no theorem link
Jagged AI in Scientific Peer Review: Evidence from POMP Data Analysis
Pith reviewed 2026-05-11 02:08 UTC · model grok-4.3
The pith
AI reviewers catch technical errors in POMP analyses that humans miss but fall short on interpretive and narrative checks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AI reviewing agents exhibited a jagged capability profile: they proficiently caught human-overlooked technical errors and invalid inference methodology but did not match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions.
What carries the argument
Comparison of four AI agents (Claude Code with differing skill-file instructions) against human peer reviews on 72 POMP student projects, measuring performance separately on technical error detection versus interpretive and narrative assessment.
If this is right
- AI can supplement human peer review by identifying technical implementation errors and flawed inference steps in dynamic model fitting that human reviewers sometimes miss.
- Changing the instructions given to AI agents can shift which specific weaknesses they display but cannot remove the overall jagged profile.
- Quality assessment of mechanistic models gains from AI assistance focused on algorithm correctness and statistical validity.
- The uniformity of jaggedness across instruction variants indicates that prompt engineering alone is insufficient to achieve balanced AI review performance.
Where Pith is reading between the lines
- Review workflows that route technical checks to AI and interpretive or narrative checks to humans could reduce reviewer workload without lowering standards.
- Efforts to improve AI narrative understanding would be required before AI could serve as a full substitute for human peer review in this domain.
- Repeating the analysis on published research papers instead of course projects would test whether the observed jaggedness holds outside student work.
Load-bearing premise
The 72 anonymized student POMP projects and their human peer reviews form a representative testbed for AI performance in scientific peer review of mechanistic dynamic models.
What would settle it
If the same AI agents, applied to a larger collection of professional rather than student POMP analyses, achieve human-comparable performance on interpretive, narrative, and domain-critique items, the claim of model-inherent jaggedness would be contradicted.
read the original abstract
Despite their growing use in academic writing and statistical analysis, the performance of artificial intelligence (AI) tools in scientific peer review remains a largely unexplored area. A key challenge is jagged AI, a phenomenon where AI exhibits strong ability spikes in some domains while remaining deficient in others. To study this jaggedness in a practical data science context, we considered the task of reviewing partially observed Markov process (POMP) data analyses. POMP models, also known as state-space models or hidden Markov models, are used to fit mechanistic dynamic models to time series data in diverse applications including disease transmission, ecological dynamics, and financial risk assessment. Quality peer review in this area entails assessment of scientific context, identification of errors in implementing complex algorithms, and decisions concerning methodological best practices. We studied 72 POMP projects from four semesters of a University of Michigan graduate time series course for which the project reports, the source code, and student peer reviews are anonymized and open-access. We compared the human reviews with four AI reviewing agents, using Claude Code with differing instructions implemented as skill files. We found that AI reviewers exhibited a jagged capability profile, proficiently catching human-overlooked technical errors and invalid inference methodology, while failing to match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was found to be similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions. Skill file configuration shifted which weaknesses agents emphasized, without removing the jaggedness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes AI performance in peer review of 72 anonymized student POMP (partially observed Markov process) data analysis projects from a University of Michigan graduate course. It compares reviews from four Claude-based AI agents (configured via differing skill files) against human student peer reviews, claiming that AI exhibits a jagged profile: strong at detecting technical errors and invalid inference methods overlooked by humans, but weaker at identifying interpretive errors, assessing narrative coherence, and offering domain-informed model critiques. The jaggedness pattern is reported as consistent across agents, indicating it stems primarily from the underlying model rather than specific instructions.
Significance. If the central empirical comparison holds after methodological clarification, the work offers concrete evidence on current LLM limitations in data-science peer review tasks, distinguishing technical/methodological strengths from interpretive weaknesses. The open-access anonymized dataset and multi-agent design with varying instructions are strengths that support reproducibility and allow testing of prompt effects. This could inform hybrid review workflows or targeted AI improvements in mechanistic modeling contexts, though the educational testbed limits broader claims about professional scientific peer review.
major comments (3)
- [Data and Methods] Data and Methods: The manuscript provides no explicit description of the error classification taxonomy (technical vs. interpretive vs. narrative/domain), the procedure for identifying 'human-overlooked' errors, or any inter-rater reliability assessment for how reviews were scored or categorized. Without these details or examples, the quantitative basis for the jaggedness claim cannot be evaluated.
- [Results] Results: The conclusion that jaggedness is 'primarily a property of the underlying AI model rather than the specific instructions' is based on observing similar patterns across four agents. No quantitative similarity metric, correlation across performance categories, or statistical comparison between configurations is reported to substantiate this over a qualitative impression.
- [Introduction and Discussion] Introduction and Discussion: The title and abstract frame the study as evidence for jagged AI in 'scientific peer review,' yet the testbed uses student course projects and student reviewers. This context likely features simpler implementation errors and lower baseline expertise than professional reviews of published work; the paper does not quantify or discuss how this affects generalizability of the observed profile.
minor comments (2)
- [Abstract] Abstract: The phrase 'Claude Code with differing instructions implemented as skill files' is unclear without a brief definition or reference to what skill files entail in this implementation.
- [Methods] The manuscript would benefit from a table summarizing the four agent configurations and their key instruction differences to aid reader understanding of the 'similar jaggedness' finding.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights areas where the manuscript can be clarified and strengthened. We address each major comment below, indicating where revisions will be made to improve transparency and rigor while preserving the core empirical findings.
read point-by-point responses
-
Referee: [Data and Methods] The manuscript provides no explicit description of the error classification taxonomy (technical vs. interpretive vs. narrative/domain), the procedure for identifying 'human-overlooked' errors, or any inter-rater reliability assessment for how reviews were scored or categorized. Without these details or examples, the quantitative basis for the jaggedness claim cannot be evaluated.
Authors: We agree that the Methods section would benefit from greater explicitness on these points. In the revised manuscript we will add a dedicated subsection describing: (1) the full error classification taxonomy with definitions and examples drawn from the POMP reviews; (2) the exact procedure used to flag human-overlooked errors (systematic side-by-side comparison of AI and human review texts, with discrepancies coded only when the AI identified a verifiable technical or methodological issue absent from all human reviews); and (3) the consensus-based scoring process employed by the author team, including any informal reliability checks performed. This addition will allow readers to evaluate the quantitative comparisons directly. revision: yes
-
Referee: [Results] The conclusion that jaggedness is 'primarily a property of the underlying AI model rather than the specific instructions' is based on observing similar patterns across four agents. No quantitative similarity metric, correlation across performance categories, or statistical comparison between configurations is reported to substantiate this over a qualitative impression.
Authors: The referee correctly notes that the similarity claim rests on qualitative pattern consistency rather than formal metrics. While the four agents were configured with deliberately divergent skill files, we did not compute cross-agent correlations or similarity indices in the original analysis. In revision we will add a supplementary table reporting pairwise correlations (or other appropriate similarity measures) of the per-category performance scores across the four agents, together with a brief statistical note on the consistency of the jagged profile. This will provide quantitative support for the claim that the underlying model, rather than prompt configuration, drives the observed pattern. revision: yes
-
Referee: [Introduction and Discussion] The title and abstract frame the study as evidence for jagged AI in 'scientific peer review,' yet the testbed uses student course projects and student reviewers. This context likely features simpler implementation errors and lower baseline expertise than professional reviews of published work; the paper does not quantify or discuss how this affects generalizability of the observed profile.
Authors: We acknowledge that the educational setting introduces limits on direct extrapolation to professional peer review. The revised Discussion will explicitly address this by (a) describing the nature of the POMP projects (graduate-level but still course-based), (b) noting that the technical-versus-interpretive jaggedness may be more pronounced or attenuated in expert reviews of published work, and (c) framing the results as evidence for a capability profile that can inform hybrid workflows rather than a definitive characterization of all scientific peer review. We cannot, however, provide a quantitative adjustment for generalizability without new data from professional reviews, which lies outside the current study scope. revision: partial
Circularity Check
No circularity: direct empirical comparison without derivations or self-referential constructions
full rationale
This is an observational empirical study that directly compares AI-generated peer reviews against human student reviews for 72 anonymized POMP course projects. No mathematical derivations, equations, fitted parameters, or predictions appear in the reported analysis. The central claim of a jagged AI capability profile is presented as an observed pattern in the collected data rather than derived from any self-definition, ansatz, or self-citation chain. The study is self-contained against its own testbed; generalizability concerns (e.g., student vs. professional context) are limitations of scope, not circular reductions of the reported findings to their inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 72 anonymized POMP student projects and peer reviews constitute a valid and representative sample for evaluating AI peer-review capabilities in mechanistic dynamic modeling.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Sub-agents, https://code.claude.com/docs/en/sub-agents . Accessed February, 2026
work page 2025
-
[2]
Dell'Acqua, Fabrizio, Edward McFowland III, Ethan Mollick, et al. 2026. ``Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality.'' Organization Science 37 (2): 403--23. https://doi.org/10.1287/orsc.2025.21838
-
[3]
Ionides, Edward L., Dao Nguyen, Yves Atchadé, Stilian Stoev, and Aaron A. King. 2015. ``Inference for Dynamic and Latent Variable Models via Iterated, Perturbed B ayes Maps.'' Proceedings of the National Academy of Sciences of USA 112 (3): 719-\/-724. https://doi.org/10.1073/pnas.1410597112
-
[4]
King, Aaron A, Dao Nguyen, and Edward L Ionides. 2016. ``Statistical Inference for Partially Observed Markov Processes via the R Package Pomp.'' Journal of Statistical Software 69 (12): 1--43. https://doi.org/10.18637/jss.v069.i12
- [5]
- [6]
-
[7]
Morris, Meredith Ringel, Dan Altman, Haydn Belfield, et al. 2026. Characterizing Model Jaggedness Supports Safety and Usability. https://www-cs.stanford.edu/ merrie/papers/jaggedness_preprint.pdf
work page 2026
-
[8]
Vaccaro, Michelle, Abdullah Almaatouq, and Thomas Malone. 2024. ``When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis.'' Nature Human Behaviour 8 (12): 2293--303
work page 2024
-
[9]
Wheeler, Jesse, Anna Rosengart, Zhuoxun Jiang, Kevin Tan, Noah Treutle, and Edward L. Ionides. 2024. ``Informing Policy via Dynamic Models: Cholera in H aiti.'' PLOS Computational Biology 20: e1012032. https://doi.org/10.1371/journal.pcbi.1012032. CSLReferences Supplementary Material sec-supp Tables Table tbl-raw-counts and Table tbl-themes summarize the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.