pith. machine review for the scientific record. sign in

arxiv: 2505.20662 · v4 · submitted 2025-05-27 · 💻 cs.AI

Recognition: unknown

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

Authors on Pith no claims yet
classification 💻 cs.AI
keywords reproductioncodeoursautoreproduceexecutionfidelityintroducelineage
0
0 comments X
read the original abstract

Efficient reproduction of research papers is pivotal to accelerating scientific progress. However, the increasing complexity of proposed methods often renders reproduction a labor-intensive endeavor, necessitating profound domain expertise. To address this, we introduce the paper lineage, which systematically mines implicit knowledge from the cited literature. This algorithm serves as the backbone of our proposed \ours, a multi-agent framework designed to autonomously reproduce experimental code in a complete, end-to-end manner. To ensure code executability, \ours incorporates a sampling-based unit testing strategy for rapid validation. To assess reproduction capabilities, we introduce \ourbench, a benchmark featuring verified implementations, alongside comprehensive metrics for evaluating both reproduction and execution fidelity. Extensive evaluations on PaperBench and \ourbench demonstrate that \ours consistently surpasses existing baselines across all metrics. Notably, it yields substantial improvements in reproduction fidelity and final execution performance. The code is available at https://github.com/AI9Stars/AutoReproduce.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

    cs.AI 2026-04 unverdicted novelty 7.0

    LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

  2. FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

    cs.AI 2026-04 conditional novelty 7.0

    FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.

  3. HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution

    cs.CL 2026-04 unverdicted novelty 6.0

    HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.