{"paper":{"title":"Jagged AI in Scientific Peer Review: Evidence from POMP Data Analysis","license":"http://creativecommons.org/licenses/by/4.0/","headline":"AI reviewers catch technical errors in POMP analyses that humans miss but fall short on interpretive and narrative checks.","cross_cats":[],"primary_cat":"stat.AP","authors_text":"Edward L. Ionides, Jin Wook Lee, William Szegda, Zhisheng Song","submitted_at":"2026-05-08T15:17:29Z","abstract_excerpt":"Despite their growing use in academic writing and statistical analysis, the performance of artificial intelligence (AI) tools in scientific peer review remains a largely unexplored area. A key challenge is jagged AI, a phenomenon where AI exhibits strong ability spikes in some domains while remaining deficient in others. To study this jaggedness in a practical data science context, we considered the task of reviewing partially observed Markov process (POMP) data analyses. POMP models, also known as state-space models or hidden Markov models, are used to fit mechanistic dynamic models to time s"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"AI reviewers exhibited a jagged capability profile, proficiently catching human-overlooked technical errors and invalid inference methodology, while failing to match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was found to be similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 72 anonymized student POMP projects and their human peer reviews form a representative and unbiased testbed for general AI performance in scientific peer review of mechanistic dynamic models.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AI reviewers of POMP data analyses detect technical and methodological errors effectively but underperform humans on interpretive, narrative, and domain-informed critique, showing consistent jaggedness.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"AI reviewers catch technical errors in POMP analyses that humans miss but fall short on interpretive and narrative checks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1addfa079faabeba7a4fffca6686e72f1102f03704b0007eb694b27f360aa986"},"source":{"id":"2605.07855","kind":"arxiv","version":2},"verdict":{"id":"93f579c8-28e5-4938-95e7-cdb7b13c8ad8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-11T02:08:11.885846Z","strongest_claim":"AI reviewers exhibited a jagged capability profile, proficiently catching human-overlooked technical errors and invalid inference methodology, while failing to match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was found to be similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions.","one_line_summary":"AI reviewers of POMP data analyses detect technical and methodological errors effectively but underperform humans on interpretive, narrative, and domain-informed critique, showing consistent jaggedness.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 72 anonymized student POMP projects and their human peer reviews form a representative and unbiased testbed for general AI performance in scientific peer review of mechanistic dynamic models.","pith_extraction_headline":"AI reviewers catch technical errors in POMP analyses that humans miss but fall short on interpretive and narrative checks."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.07855/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T15:31:18.602322Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T11:28:32.297449Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"d35f8b8a3de6c0dfffdff882e474240b41c0c319f50e0d1523662ae48d00dd93"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}