pith. sign in

arxiv: 2607.02134 · v1 · pith:JMTCB44Pnew · submitted 2026-07-02 · 💻 cs.AI

Coding-agents can replicate scientific machine learning papers

Pith reviewed 2026-07-03 13:03 UTC · model grok-4.3

classification 💻 cs.AI
keywords coding agentspaper replicationscientific machine learningreproducibilitycomputational claimsworkflowtargetsevidence matching
0
0 comments X

The pith

A workflow turns paper claims into tracked targets so coding agents can replicate scientific machine learning results with verifiable evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that coding agents can be guided to reproduce the computational claims made in scientific machine learning papers when the task is structured as a sequence of recorded targets. It defines Paper-replication as the mechanism that records each claim, reconstructs the method from the paper text, executes the required experiments, links outputs to provenance, and checks that evidence appears in the final report. Across twelve independent runs on four papers the workflow produced completed workspaces in every case and matched evidence for all 158 targets. The approach shifts the measure of success from the agent's concluding message to the presence of linked, validated evidence inside the workspace.

Core claim

Paper-replication records each selected paper claim as a target with evidence, reconstructs the method from the paper materials alone, runs the computational experiments, links generated outputs to provenance and direct comparisons with the original claims, records the location of matched evidence inside the replication report, and requires validation checks before the workspace is marked complete. In twelve runs across four papers every workspace passed the completion gate and every one of the 158 recorded targets was matched with report coverage, even though the runs differed in target division, numerical fidelity, elapsed time, number of intermediate executions, and acceptance rules.

What carries the argument

The Paper-replication workflow, which converts each paper claim into a target that must be matched with provenance-linked evidence and pass explicit validation checks.

If this is right

  • Replication success becomes a property of the workspace state rather than the agent's final statement.
  • Each claim receives an explicit record of where matching evidence appears in the report.
  • Variations in target division and acceptance rules can occur even after all targets are covered.
  • The process produces a report that directly supports or refutes the paper's original computational claims.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same target-tracking structure could be applied to computational claims in papers outside machine learning.
  • Rules for splitting claims into targets and accepting evidence could be refined to reduce variation across runs.
  • Measuring exact numerical agreement versus approximate agreement would give a finer test of replication quality.

Load-bearing premise

The coding agent can reconstruct the paper's method from the given materials and generate correct computational evidence without external domain knowledge or implementation mistakes.

What would settle it

A completed workspace in which the generated numerical results or method steps differ from the paper in a way that makes the evidence invalid for the recorded targets.

Figures

Figures reproduced from arXiv: 2607.02134 by Atharva Hans, Ilias Bilionis.

Figure 1
Figure 1. Figure 1: The Paper-replication workflow centers on two mechanisms: a persistent workspace that records the agent’s state, and validation checks that decide whether [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-run workspace evidence for the twelve case-study runs. Panel (a) shows elapsed replication time. Panel (b) shows the final target count in the [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Target coverage across runs. Points show the final target count recorded [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Paper-anchored scalar fidelity for the thirteen standardized numeric anchors. Panel (a) shows the base-ten logarithm of the reproduced discrepancy divided [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Elapsed replication time and correction work across papers. Panel (a) shows the posterior distribution of per-paper elapsed replication time. The dot [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Judgment variation across runs. Bars show the fraction of aligned [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Scientific machine learning papers typically make computational claims, e.g., that the relative mean square error is less than 5% or that the 95% predictive credible interval covers the test data. A coding agent can be prompted to replicate those claims from paper materials alone, but the prompt does not by itself reliably preserve progress or check whether generated evidence supports the paper's claims. We introduce Paper-replication, a workflow that makes each selected paper claim a target with recorded evidence, and implement it as a coding-agent skill. The workflow makes the agent record those targets, reconstruct the paper's method, run computational experiments, link generated outputs to provenance and comparisons with the paper's claims, record where matched evidence appears in the replication report, and pass validation checks before completion. We evaluate Paper-replication on twelve independent runs across four scientific machine learning papers. All twelve workspaces pass the completion gate, and all 158 recorded targets are matched with report coverage. Even in this completed workspace state, repeated runs differ in how papers are divided into targets, in numerical fidelity to the source papers, in elapsed replication time, in the number of intermediate executions replaced before final evidence is accepted, and in the rules used to accept evidence. Paper-replication makes completion depend on workspace evidence and validation checks rather than on the agent's final message.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Paper-replication, a structured workflow implemented as a coding-agent skill that converts computational claims from scientific ML papers into recorded targets, requires the agent to reconstruct methods, generate evidence with provenance links and claim comparisons, and pass validation checks before declaring completion. On twelve independent runs across four papers, all workspaces reach the completion gate and all 158 targets are matched with report coverage, although the abstract notes variability across runs in numerical fidelity, target decomposition, elapsed time, intermediate executions, and acceptance rules.

Significance. If the workflow produces replications whose evidence faithfully matches the source claims in both method and numerical outcome, the approach would offer a concrete, auditable mechanism for automated scientific reproducibility that goes beyond unstructured prompting. The explicit recording of targets, provenance, and validation steps is a methodological strength that could be adopted more broadly; however, the dependence on internal agent-driven checks and the acknowledged variability in fidelity limit the immediate impact until external validation is demonstrated.

major comments (2)
  1. [Abstract] Abstract: the headline claim that 'all twelve workspaces pass the completion gate, and all 158 recorded targets are matched with report coverage' is load-bearing for the central thesis, yet the same paragraph records variability in numerical fidelity and acceptance rules; without quantitative bounds on acceptable deviation or an independent fidelity metric, it remains unclear whether matched targets constitute accurate reconstruction of the original methods and results.
  2. [Abstract] Abstract and evaluation description: success is defined entirely by the agent's selection of targets, its own comparisons to the source claims, and satisfaction of the workflow's internal validation checks; this creates a risk that a workspace can complete while the generated implementation deviates from the paper's method or produces only loosely corresponding numerical outcomes, exactly the concern raised by the weakest assumption in the evaluation design.
minor comments (2)
  1. The manuscript would benefit from an explicit methods subsection detailing the precise acceptance rules used for evidence and how they were held constant (or allowed to vary) across the twelve runs.
  2. Clarify whether the four source papers were chosen according to pre-specified criteria (e.g., computational claims only, open code, etc.) so that the 12-run evaluation can be assessed for selection bias.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on the evaluation design in our manuscript. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'all twelve workspaces pass the completion gate, and all 158 recorded targets are matched with report coverage' is load-bearing for the central thesis, yet the same paragraph records variability in numerical fidelity and acceptance rules; without quantitative bounds on acceptable deviation or an independent fidelity metric, it remains unclear whether matched targets constitute accurate reconstruction of the original methods and results.

    Authors: The claim in the abstract is factual with respect to the workflow: every workspace satisfied the completion criteria, and every target had associated report coverage through the required provenance and comparison steps. The variability in numerical fidelity is explicitly noted to indicate that 'matched' refers to the presence of linked evidence and a comparison step, not necessarily to zero deviation. We do not provide quantitative bounds because the acceptance rules are part of the workflow's internal validation and vary by target type (e.g., error thresholds for MSE claims). We will revise the manuscript to add a short clarification in the abstract and evaluation section explaining that matching is determined by the workflow's validation checks rather than an external metric. This constitutes a partial revision. revision: partial

  2. Referee: [Abstract] Abstract and evaluation description: success is defined entirely by the agent's selection of targets, its own comparisons to the source claims, and satisfaction of the workflow's internal validation checks; this creates a risk that a workspace can complete while the generated implementation deviates from the paper's method or produces only loosely corresponding numerical outcomes, exactly the concern raised by the weakest assumption in the evaluation design.

    Authors: The design intentionally places the validation inside the workflow to create an auditable record of targets, evidence, and comparisons. The agent must record targets from the paper, generate evidence with provenance, perform comparisons, and pass the checks; completion is not granted by the agent's final message alone. While this does not eliminate the possibility of loose correspondence, the requirement for explicit links and report coverage makes deviations traceable. We agree that this is a limitation of the current evaluation and does not substitute for external validation. No revision is planned for this point as it reflects the stated scope of the work. revision: no

standing simulated objections not resolved
  • Demonstration of external validation or independent fidelity assessment of the generated replications.

Circularity Check

0 steps flagged

No circularity: success metrics tied to external paper claims, not internal definitions

full rationale

The paper describes an empirical workflow evaluation on twelve runs across four independent scientific machine learning papers, reporting that all workspaces pass completion and all 158 targets match with report coverage. No equations, fitted parameters, or derivations appear in the provided text. The central claim rests on matching generated evidence to claims extracted from external source papers rather than any self-referential reduction, self-citation chain, or renaming of known results. The workflow's internal validation checks serve as an implementation detail for the evaluation protocol but do not force the reported success rate by construction, as the targets originate outside the workflow.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of the workflow; no free parameters are introduced, and the only background assumption is the agent's general capability to follow structured instructions from text.

axioms (1)
  • domain assumption Coding agents prompted with paper materials alone can reconstruct and execute the computational methods described in scientific ML papers.
    Invoked in the description of the workflow and the evaluation setup.

pith-pipeline@v0.9.1-grok · 5759 in / 1124 out tokens · 42837 ms · 2026-07-03T13:03:06.633636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 17 canonical work pages · 12 internal anchors

  1. [1]

    Categorizing Variants of Goodhart's Law

    Categorizing variants of Goodhart's Law , author=. arXiv preprint arXiv:1803.04585 , year=

  2. [2]

    2024 , howpublished=

    Reward Hacking in Reinforcement Learning , author=. 2024 , howpublished=

  3. [3]

    Proceedings of machine learning and systems , volume=

    Accounting for variance in machine learning benchmarks , author=. Proceedings of machine learning and systems , volume=

  4. [4]

    OpenAI engineering note , year=

    Harness engineering: leveraging codex in an agent-first world , author=. OpenAI engineering note , year=

  5. [5]

    Science , volume=

    Reproducible research in computational science , author=. Science , volume=. 2011 , publisher=

  6. [6]

    Science , volume=

    Enhancing reproducibility for computational methods , author=. Science , volume=. 2016 , publisher=

  7. [7]

    PLoS computational biology , volume=

    Ten simple rules for reproducible computational research , author=. PLoS computational biology , volume=. 2013 , publisher=

  8. [8]

    PLoS computational biology , volume=

    Good enough practices in scientific computing , author=. PLoS computational biology , volume=. 2017 , publisher=

  9. [9]

    Terminologies for Reproducible Research

    Terminologies for reproducible research , author=. arXiv preprint arXiv:1802.03311 , year=

  10. [10]

    Journal of Computing and Information Science in Engineering , volume=

    A bayesian hierarchical model for extracting individuals’ theory-based causal knowledge , author=. Journal of Computing and Information Science in Engineering , volume=. 2023 , publisher=

  11. [11]

    2024 , school=

    A SCALABLE PROBABILISTIC METHOD FOR SUPER-RESOLUTION OF 4D FLOW MRI HEMODYNAMIC FIELDS , author=. 2024 , school=

  12. [12]

    Journal of machine learning research , volume=

    Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program) , author=. Journal of machine learning research , volume=

  13. [13]

    A practical taxonomy of reproducibility for machine learning research , author=

  14. [14]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    State of the art: Reproducibility in artificial intelligence , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  15. [15]

    arXiv preprint arXiv:2505.12494 , year=

    SMURF: Scalable method for unsupervised reconstruction of flow in 4D flow MRI , author=. arXiv preprint arXiv:2505.12494 , year=

  16. [16]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Deep reinforcement learning that matters , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  17. [17]

    Nature Reviews Physics , volume=

    Physics-informed machine learning , author=. Nature Reviews Physics , volume=. 2021 , publisher=

  18. [18]

    Journal of Computational physics , volume=

    Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , author=. Journal of Computational physics , volume=. 2019 , publisher=

  19. [19]

    Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations

    Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations , author=. arXiv preprint arXiv:1711.10561 , year=

  20. [20]

    Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations

    Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations , author=. arXiv preprint arXiv:1711.10566 , year=

  21. [21]

    Measurement Science and Technology , volume=

    Bayesian reconstruction of 3D particle positions in high-seeding density flows , author=. Measurement Science and Technology , volume=. 2024 , publisher=

  22. [22]

    Proceedings of the national academy of sciences , volume=

    Discovering governing equations from data by sparse identification of nonlinear dynamical systems , author=. Proceedings of the national academy of sciences , volume=. 2016 , publisher=

  23. [23]

    Science advances , volume=

    Data-driven discovery of partial differential equations , author=. Science advances , volume=. 2017 , publisher=

  24. [24]

    Universal Differential Equations for Scientific Machine Learning

    Universal differential equations for scientific machine learning , author=. arXiv preprint arXiv:2001.04385 , year=

  25. [25]

    Advances in neural information processing systems , volume=

    Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=

  26. [26]

    International Design Engineering Technical Conferences and Computers and Information in Engineering Conference , volume=

    Quantifying individuals’ theory-based knowledge using probabilistic causal graphs: a bayesian hierarchical approach , author=. International Design Engineering Technical Conferences and Computers and Information in Engineering Conference , volume=. 2020 , organization=

  27. [27]

    Acta numerica , volume=

    Inverse problems: a Bayesian perspective , author=. Acta numerica , volume=. 2010 , publisher=

  28. [28]

    Journal of Computational Physics , volume=

    Physics-informed information field theory for modeling physical systems with uncertainty quantification , author=. Journal of Computational Physics , volume=. 2023 , publisher=

  29. [29]

    2026 , note=

    Codex web , author=. 2026 , note=

  30. [30]

    2026 , note=

    Claude Code overview , author=. 2026 , note=

  31. [31]

    2025 , note=

    Equipping agents for the real world with Agent Skills , author=. 2025 , note=

  32. [32]

    2026 , note=

    Agent Skills , author=. 2026 , note=

  33. [33]

    International Conference on Learning Representations , volume=

    Swe-bench: Can language models resolve real-world github issues? , author=. International Conference on Learning Representations , volume=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    International Conference on Learning Representations , volume=

    Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=

  36. [36]

    International conference on learning representations , volume=

    Large language models cannot self-correct reasoning yet , author=. International conference on learning representations , volume=

  37. [37]

    W3C Recommendation , volume=

    Prov-dm: The prov data model , author=. W3C Recommendation , volume=. 2013 , publisher=

  38. [38]

    Proceedings of the 2008 ACM SIGMOD international conference on Management of data , pages=

    Provenance and scientific workflows: challenges and opportunities , author=. Proceedings of the 2008 ACM SIGMOD international conference on Management of data , pages=

  39. [39]

    Genome biology , volume=

    Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , author=. Genome biology , volume=. 2010 , publisher=

  40. [40]

    15th Int

    Stochastic volumetric reconstruction , author=. 15th Int. Symp. on Particle Image Velocimetry-ISPIV , year=

  41. [41]

    Positioning and power in academic publishing: players, agents and agendas: proceedings of the 20th International Conference on Electronic Publishing , pages=

    Jupyter Notebooks-a publishing format for reproducible computational workflows , author=. Positioning and power in academic publishing: players, agents and agendas: proceedings of the 20th International Conference on Electronic Publishing , pages=. 2016 , organization=

  42. [42]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    PaperBench: Evaluating AI's Ability to Replicate AI Research , author=. arXiv preprint arXiv:2504.01848 , year=

  43. [43]

    arXiv preprint arXiv:2504.17192 , year=

    Paper2code: Automating code generation from scientific papers in machine learning , author=. arXiv preprint arXiv:2504.17192 , year=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Researchcodebench: Benchmarking llms on implementing novel machine learning research code , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    International Workshop on AI for Transportation , pages=

    Researchcodeagent: An llm multi-agent system for automated codification of research methodologies , author=. International Workshop on AI for Transportation , pages=. 2025 , organization=

  46. [46]

    arXiv preprint arXiv:2504.00255 , year=

    Scireplicate-bench: Benchmarking llms in agent-driven algorithmic reproduction from research papers , author=. arXiv preprint arXiv:2504.00255 , year=

  47. [47]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Lmr-bench: Evaluating llm agent’s ability on reproducing language modeling research , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  48. [48]

    arXiv preprint arXiv:2506.19724 , year=

    From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking , author=. arXiv preprint arXiv:2506.19724 , year=

  49. [49]

    CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

    Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark , author=. arXiv preprint arXiv:2409.11363 , year=

  50. [50]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    REPRO-BENCH: Can Agentic AI Systems Assess the Reproducibility of Social Science Research? , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  51. [51]

    ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences

    Replicatorbench: Benchmarking llm agents for replicability in social and behavioral sciences , author=. arXiv preprint arXiv:2602.11354 , year=

  52. [52]

    AI Coding Agents Can Reproduce Social Science Findings

    AI Coding Agents Can Reproduce Social Science Findings , author=. arXiv preprint arXiv:2606.11447 , year=

  53. [53]

    Companion Proceedings of the ACM Web Conference 2026 , pages=

    Automating computational reproducibility in social science: Comparing prompt-based and agent-based approaches , author=. Companion Proceedings of the ACM Web Conference 2026 , pages=

  54. [54]

    Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

    Read the paper, write the code: Agentic reproduction of social-science results , author=. arXiv preprint arXiv:2604.21965 , year=

  55. [55]

    Replicationbench: Can AI agents replicate astrophysics research papers?arXiv preprint arXiv:2510.24591, 2025

    ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers? , author=. arXiv preprint arXiv:2510.24591 , year=

  56. [56]

    Can Coding Agents Reproduce Findings in Computational Materials Science?

    Can Coding Agents Reproduce Findings in Computational Materials Science? , author=. arXiv preprint arXiv:2605.00803 , year=

  57. [57]

    Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

    Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction , author=. arXiv preprint arXiv:2605.13950 , year=