pith. machine review for the scientific record. sign in

arxiv: 2605.12131 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Rollout Cards: A Reproducibility Standard for Agent Research

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords reproducibilityagent researchrollout recordsreporting rulesbenchmarkingreinforcement learningevaluation standards
0
0 comments X

The pith

Rollout records should replace reported scores as the unit of reproducibility in agent research, enabling re-grading under different rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agent research suffers from reproducibility issues because papers report scores without providing the underlying rollout records that generated them. An audit of 50 repositories reveals that none report failure rates and that 37 cases exist where different reporting rules change the scores for the same evidence. The authors introduce rollout cards as standardized bundles that include the full rollout records along with the specific views and rules applied to produce the reported numbers. Validation shows that re-applying different rules to preserved outputs can alter scores by 20.9 absolute percentage points and invert some model rankings. A sympathetic reader would care because this approach makes evaluations transparent and allows consistent progress measurement across changing evaluation standards.

Core claim

We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. Re-grading preserved benchmark outputs shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models.

What carries the argument

Rollout cards, which are publication bundles preserving the rollout record and declaring the views, reporting rules, and drops manifests used to compute reported scores.

If this is right

  • Four partial public releases in tool safety, multi-agent systems, theorem proving, and search allow new analyses not in the original reports.
  • Re-grading across short-answer, code-generation, and tool-use tasks reveals score changes of up to 20.9 points from rule variations alone.
  • Model rankings can invert when the same rollout data is scored under different rules.
  • Open-source implementation in Ergon and published rollout-card exports support standardized evaluation across tool use, software engineering, web interaction, multi-agent coordination, safety, and search.
  • None of the 50 audited repositories report how many runs failed, errored, or were skipped alongside headline scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting rollout cards could reduce the need to re-run experiments when evaluation criteria evolve.
  • Storing rollout records might enable automated detection of inconsistencies in reported metrics.
  • This standard could extend to other domains like robotics where trajectory data is central to performance claims.
  • Future benchmarks might require rollout card submissions as a condition for publication.

Load-bearing premise

The full rollout records contain enough information to reconstruct and re-evaluate any reported score under alternative rules without needing additional hidden state or implementation details from the original system.

What would settle it

Finding a preserved rollout record where applying the declared reporting rule does not reproduce the original reported score, or where a new rule cannot be applied due to missing information in the record.

Figures

Figures reproduced from arXiv: 2605.12131 by Charlie Masters, Stefano V. Albrecht, Ziyuan Liu.

Figure 1
Figure 1. Figure 1: Public rollout releases contain analyses not reported by their original benchmark scores. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: One sample from a rollout card, rendered by a reference viewer over a ResearchRubrics [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗
read the original abstract

Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that agent research reproducibility is undermined by reliance on reported scores without preserving rollout records, as the same behavior can yield different scores under varying selection or reporting rules. A structured audit of 50 repositories finds none report failure/error/skip counts and documents 37 cases where rules alter success rates, costs, or timings for fixed evidence. It proposes rollout cards as bundles preserving full records plus declarations of views, rules, and drop manifests. Validation occurs in two settings: re-analysis of partial public releases across domains and re-grading of preserved outputs showing up to 20.9 pp score shifts with occasional frontier-model ranking inversions. A reference implementation is released in Ergon with public benchmark exports.

Significance. If the empirical results hold, this provides a practical, record-based standard that could meaningfully advance reproducibility practices in agent research, where tasks involve tool use, web interaction, and multi-agent coordination. Strengths include the open release of code and data exports, multi-domain validation, and emphasis on falsifiable rollout preservation over self-referential scores. The work addresses a timely gap as agentic systems scale.

major comments (3)
  1. Abstract: The structured audit of 50 repositories and the identification of 37 cases provide no methodological details on repository selection, search protocol, case identification criteria, or the exact re-grading procedure used to obtain the 20.9 pp figure. This is load-bearing for the central quantitative claims.
  2. Validation settings (second setting): The re-grading experiment that produces score shifts and ranking inversions assumes preserved outputs contain every datum required by alternative scoring functions (full tool responses, error traces, exact kept/dropped subsets). The manuscript does not demonstrate that the proposed rollout-card format captures these elements without re-execution or private state, leaving the headline result vulnerable.
  3. Reproducibility proposal: It is not shown that rollout cards suffice to reconstruct any reported score under arbitrary rules for tool-use and web-interaction tasks, where hidden environment responses or selection metadata may be required; the audit documents omissions but does not close this gap.
minor comments (1)
  1. Abstract: The four partial public releases used in the first validation setting are not named or linked, reducing immediate verifiability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where our manuscript requires greater methodological transparency and clarification. We address each major comment point by point below, indicating where revisions will be made to the next version of the paper.

read point-by-point responses
  1. Referee: Abstract: The structured audit of 50 repositories and the identification of 37 cases provide no methodological details on repository selection, search protocol, case identification criteria, or the exact re-grading procedure used to obtain the 20.9 pp figure. This is load-bearing for the central quantitative claims.

    Authors: We agree that the manuscript currently lacks the necessary methodological details on how the 50 repositories were selected, the search protocol employed, the criteria used to identify the 37 cases, and the precise steps in the re-grading procedure yielding the 20.9 percentage point shifts. These elements are indeed central to the quantitative claims. In the revised manuscript, we will expand the relevant sections (primarily Methods and the description of the audit) to include: explicit criteria for repository selection (e.g., popularity via GitHub stars, citations, and relevance to agent benchmarks), the search terms and sources used, the identification criteria for cases where reporting rules alter outcomes, and a step-by-step account of the re-grading process with examples of the rules applied and how preserved outputs were processed. revision: yes

  2. Referee: Validation settings (second setting): The re-grading experiment that produces score shifts and ranking inversions assumes preserved outputs contain every datum required by alternative scoring functions (full tool responses, error traces, exact kept/dropped subsets). The manuscript does not demonstrate that the proposed rollout-card format captures these elements without re-execution or private state, leaving the headline result vulnerable.

    Authors: The referee correctly notes that the re-grading results rest on an assumption about the completeness of preserved outputs. The rollout-card format is designed to bundle full rollout records (including tool responses, error traces, and drop manifests), but the manuscript does not provide an explicit demonstration or verification that this format captures all required data for alternative scoring functions across the evaluated tasks without needing re-execution or private state. We will revise the Validation section to include the detailed rollout-card schema, concrete examples from the short-answer, code-generation, and tool-use tasks illustrating how the preserved elements enable re-grading, and an expanded limitations discussion addressing cases (such as certain web interactions) where additional metadata may still be required. revision: partial

  3. Referee: Reproducibility proposal: It is not shown that rollout cards suffice to reconstruct any reported score under arbitrary rules for tool-use and web-interaction tasks, where hidden environment responses or selection metadata may be required; the audit documents omissions but does not close this gap.

    Authors: We do not claim that rollout cards enable reconstruction of scores under arbitrary rules in every conceivable case, especially for tool-use and web-interaction tasks that may depend on hidden environment responses or private selection metadata not present in the rollout record. The manuscript positions rollout cards as a standard that preserves complete records and declares the specific views, rules, and drop manifests used, thereby addressing the documented omissions in existing repositories and enabling re-analysis where the preserved data suffices. The two validation settings illustrate this in practice. We will add clarifying language in the Reproducibility proposal and Discussion sections to delineate the scope and limitations, but we do not intend to alter the core proposal or add a universal proof, as that exceeds what the current evidence supports. revision: no

Circularity Check

0 steps flagged

No significant circularity: claims rest on external audit and re-grading

full rationale

The paper's central results derive from a structured audit of 50 public repositories (finding zero reports of failure counts) and re-grading experiments on preserved benchmark outputs that produce the 20.9 pp score shifts. These quantities are computed directly from existing rollout data under alternative reporting rules; they do not reduce to any quantity defined by the rollout-card proposal itself. The introduction of rollout cards is a definitional standardization step whose validation uses independent evidence rather than self-referential fitting, self-citation chains, or ansatz smuggling. No equations, fitted parameters, or uniqueness theorems appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim introduces the new concept of rollout cards and relies on empirical observations from an audit and re-grading experiments; no free parameters, mathematical axioms, or invented physical entities are used.

invented entities (1)
  • rollout card no independent evidence
    purpose: A publication bundle that preserves the rollout record and declares the views, reporting rules, and drops manifests behind reported scores
    New artifact defined by the paper to serve as the unit of reproducibility

pith-pipeline@v0.9.0 · 5586 in / 1398 out tokens · 29281 ms · 2026-05-13T05:55:10.056608+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    URLhttps://arxiv.org/abs/2502.01600. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261. Tongwei Deng, Xutian Li, Runbang Yan, Wanning Li, Yifeng Zhu...

  2. [2]

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle

    URLhttps://arxiv.org/abs/2403.13793. Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435. Y...

  3. [3]

    HybridFlow: A Flexible and Efficient RLHF Framework , url=

    URLhttps://openreview.net/forum?id=dHng2O0Jjr. Enrico Saccon, Ahmet Tikna, Davide De Martini, Edoardo Lamon, Luigi Palopoli, and Marco Roveri. A temporal planning framework for multi-agent systems via llm-aided knowledge base management, 2025. URLhttps://arxiv.org/abs/2502.19135. Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit A...

  4. [4]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    URLhttps://openreview.net/forum?id=V7HRrxXUhN. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. Multi-agent collaboration mechanisms: A survey of llms, 2025. URL https://arxiv.org/abs/2501.06322. UK AI Safety Institute. Inspect AI: Framework for large language model evaluations. https: //github.com/UKGov...

  5. [5]

    URLhttps://openreview.net/forum?id=q7TxGUWlhD. Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforce...

  6. [6]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    URLhttps://arxiv.org/abs/2405.15793. John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025a. URLhttps://arxiv.org/abs/2504.21798. Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning L...

  7. [7]

    public-release documentation

    URLhttps://openreview.net/forum?id=9ZPegFuFTFv. Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. URLhttp...

  8. [8]

    (Direct empirical evidence.)

    Tier 1 — strongest.Same model and workload; reported scores differ across two harnesses. (Direct empirical evidence.)

  9. [9]

    (Vendor-admission-equivalent.)

    Tier 2.Explicit vendor or framework documentation acknowledging the convention differ- ence between harnesses. (Vendor-admission-equivalent.)

  10. [10]

    (Structural difference visible at source.)

    Tier 3.Code-level convention difference with no cross-framework normalisation, published under the same metric name. (Structural difference visible at source.)

  11. [11]

    post-mortem

    Tier 4.Maintainer issue-tracker statement that the convention differs from another frame- work. (Maintainer-acknowledged difference.) Anything weaker is demoted to near-misses (§C.8). Every pair below cites either a numeric gap, a vendor acknowledgement, a code-level convention difference, or a maintainer statement. The “Held” column classifies counterfac...

  12. [12]

    No IRB review is applicable

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...