arxiv: 2605.12131 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Rollout Cards: A Reproducibility Standard for Agent Research

Charlie Masters , Ziyuan Liu , Stefano V. Albrecht

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords reproducibilityagent researchrollout recordsreporting rulesbenchmarkingreinforcement learningevaluation standards

0 comments

The pith

Rollout records should replace reported scores as the unit of reproducibility in agent research, enabling re-grading under different rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agent research suffers from reproducibility issues because papers report scores without providing the underlying rollout records that generated them. An audit of 50 repositories reveals that none report failure rates and that 37 cases exist where different reporting rules change the scores for the same evidence. The authors introduce rollout cards as standardized bundles that include the full rollout records along with the specific views and rules applied to produce the reported numbers. Validation shows that re-applying different rules to preserved outputs can alter scores by 20.9 absolute percentage points and invert some model rankings. A sympathetic reader would care because this approach makes evaluations transparent and allows consistent progress measurement across changing evaluation standards.

Core claim

We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. Re-grading preserved benchmark outputs shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models.

What carries the argument

Rollout cards, which are publication bundles preserving the rollout record and declaring the views, reporting rules, and drops manifests used to compute reported scores.

If this is right

Four partial public releases in tool safety, multi-agent systems, theorem proving, and search allow new analyses not in the original reports.
Re-grading across short-answer, code-generation, and tool-use tasks reveals score changes of up to 20.9 points from rule variations alone.
Model rankings can invert when the same rollout data is scored under different rules.
Open-source implementation in Ergon and published rollout-card exports support standardized evaluation across tool use, software engineering, web interaction, multi-agent coordination, safety, and search.
None of the 50 audited repositories report how many runs failed, errored, or were skipped alongside headline scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting rollout cards could reduce the need to re-run experiments when evaluation criteria evolve.
Storing rollout records might enable automated detection of inconsistencies in reported metrics.
This standard could extend to other domains like robotics where trajectory data is central to performance claims.
Future benchmarks might require rollout card submissions as a condition for publication.

Load-bearing premise

The full rollout records contain enough information to reconstruct and re-evaluate any reported score under alternative rules without needing additional hidden state or implementation details from the original system.

What would settle it

Finding a preserved rollout record where applying the declared reporting rule does not reproduce the original reported score, or where a new rule cannot be applied due to missing information in the record.

Figures

Figures reproduced from arXiv: 2605.12131 by Charlie Masters, Stefano V. Albrecht, Ziyuan Liu.

**Figure 2.** Figure 2: One sample from a rollout card, rendered by a reference viewer over a ResearchRubrics [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗

read the original abstract

Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rollout cards bundle preserved rollout records with explicit rules and manifests to make agent scores inspectable, and the audit shows real reporting gaps, but the methods need more detail to back the scale of the claims.

read the letter

Rollout cards seem like a practical addition to how we report agent results. The idea is to treat the full rollout records as the primary artifact, not just the final scores, and to explicitly declare the views, rules, and any drops that were applied. This directly targets the issue where the same behavior gets scored differently depending on selection or counting choices. The paper does well in showing the current state of affairs. Their audit of 50 repositories found that none include information on failed, errored, or skipped runs with the headline numbers. They also identified 37 cases where different reporting rules on the same data lead to big changes in success rates or other metrics. The re-grading experiments across different task types, including showing score shifts of 20.9 points and ranking inversions, give a clear sense of why this is a problem. Releasing a reference implementation and some public rollout-card exports is a positive step that lets others build on it. The main soft spot is the lack of detail on the audit process. Without knowing exactly how the repositories were chosen, how the 37 cases were spotted, or the precise re-grading method, it's difficult to gauge the robustness of those findings. The stress-test concern about whether preserved outputs always have enough information for full re-evaluation, especially in tool-use or web tasks that might involve unlogged responses, is worth addressing directly in the paper. If the rollout cards include all necessary data, that would strengthen the case. This work is for researchers in AI agents who run evaluations and want to make their results more transparent and verifiable. Anyone working on benchmarks in areas like software engineering, safety, or coordination would get something out of the concrete proposal. It deserves a serious referee because the problem it targets is genuine and the suggested format is actionable, though the methods could use more transparency to make the empirical support tighter.

Referee Report

3 major / 1 minor

Summary. The paper claims that agent research reproducibility is undermined by reliance on reported scores without preserving rollout records, as the same behavior can yield different scores under varying selection or reporting rules. A structured audit of 50 repositories finds none report failure/error/skip counts and documents 37 cases where rules alter success rates, costs, or timings for fixed evidence. It proposes rollout cards as bundles preserving full records plus declarations of views, rules, and drop manifests. Validation occurs in two settings: re-analysis of partial public releases across domains and re-grading of preserved outputs showing up to 20.9 pp score shifts with occasional frontier-model ranking inversions. A reference implementation is released in Ergon with public benchmark exports.

Significance. If the empirical results hold, this provides a practical, record-based standard that could meaningfully advance reproducibility practices in agent research, where tasks involve tool use, web interaction, and multi-agent coordination. Strengths include the open release of code and data exports, multi-domain validation, and emphasis on falsifiable rollout preservation over self-referential scores. The work addresses a timely gap as agentic systems scale.

major comments (3)

Abstract: The structured audit of 50 repositories and the identification of 37 cases provide no methodological details on repository selection, search protocol, case identification criteria, or the exact re-grading procedure used to obtain the 20.9 pp figure. This is load-bearing for the central quantitative claims.
Validation settings (second setting): The re-grading experiment that produces score shifts and ranking inversions assumes preserved outputs contain every datum required by alternative scoring functions (full tool responses, error traces, exact kept/dropped subsets). The manuscript does not demonstrate that the proposed rollout-card format captures these elements without re-execution or private state, leaving the headline result vulnerable.
Reproducibility proposal: It is not shown that rollout cards suffice to reconstruct any reported score under arbitrary rules for tool-use and web-interaction tasks, where hidden environment responses or selection metadata may be required; the audit documents omissions but does not close this gap.

minor comments (1)

Abstract: The four partial public releases used in the first validation setting are not named or linked, reducing immediate verifiability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where our manuscript requires greater methodological transparency and clarification. We address each major comment point by point below, indicating where revisions will be made to the next version of the paper.

read point-by-point responses

Referee: Abstract: The structured audit of 50 repositories and the identification of 37 cases provide no methodological details on repository selection, search protocol, case identification criteria, or the exact re-grading procedure used to obtain the 20.9 pp figure. This is load-bearing for the central quantitative claims.

Authors: We agree that the manuscript currently lacks the necessary methodological details on how the 50 repositories were selected, the search protocol employed, the criteria used to identify the 37 cases, and the precise steps in the re-grading procedure yielding the 20.9 percentage point shifts. These elements are indeed central to the quantitative claims. In the revised manuscript, we will expand the relevant sections (primarily Methods and the description of the audit) to include: explicit criteria for repository selection (e.g., popularity via GitHub stars, citations, and relevance to agent benchmarks), the search terms and sources used, the identification criteria for cases where reporting rules alter outcomes, and a step-by-step account of the re-grading process with examples of the rules applied and how preserved outputs were processed. revision: yes
Referee: Validation settings (second setting): The re-grading experiment that produces score shifts and ranking inversions assumes preserved outputs contain every datum required by alternative scoring functions (full tool responses, error traces, exact kept/dropped subsets). The manuscript does not demonstrate that the proposed rollout-card format captures these elements without re-execution or private state, leaving the headline result vulnerable.

Authors: The referee correctly notes that the re-grading results rest on an assumption about the completeness of preserved outputs. The rollout-card format is designed to bundle full rollout records (including tool responses, error traces, and drop manifests), but the manuscript does not provide an explicit demonstration or verification that this format captures all required data for alternative scoring functions across the evaluated tasks without needing re-execution or private state. We will revise the Validation section to include the detailed rollout-card schema, concrete examples from the short-answer, code-generation, and tool-use tasks illustrating how the preserved elements enable re-grading, and an expanded limitations discussion addressing cases (such as certain web interactions) where additional metadata may still be required. revision: partial
Referee: Reproducibility proposal: It is not shown that rollout cards suffice to reconstruct any reported score under arbitrary rules for tool-use and web-interaction tasks, where hidden environment responses or selection metadata may be required; the audit documents omissions but does not close this gap.

Authors: We do not claim that rollout cards enable reconstruction of scores under arbitrary rules in every conceivable case, especially for tool-use and web-interaction tasks that may depend on hidden environment responses or private selection metadata not present in the rollout record. The manuscript positions rollout cards as a standard that preserves complete records and declares the specific views, rules, and drop manifests used, thereby addressing the documented omissions in existing repositories and enabling re-analysis where the preserved data suffices. The two validation settings illustrate this in practice. We will add clarifying language in the Reproducibility proposal and Discussion sections to delineate the scope and limitations, but we do not intend to alter the core proposal or add a universal proof, as that exceeds what the current evidence supports. revision: no

Circularity Check

0 steps flagged

No significant circularity: claims rest on external audit and re-grading

full rationale

The paper's central results derive from a structured audit of 50 public repositories (finding zero reports of failure counts) and re-grading experiments on preserved benchmark outputs that produce the 20.9 pp score shifts. These quantities are computed directly from existing rollout data under alternative reporting rules; they do not reduce to any quantity defined by the rollout-card proposal itself. The introduction of rollout cards is a definitional standardization step whose validation uses independent evidence rather than self-referential fitting, self-citation chains, or ansatz smuggling. No equations, fitted parameters, or uniqueness theorems appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim introduces the new concept of rollout cards and relies on empirical observations from an audit and re-grading experiments; no free parameters, mathematical axioms, or invented physical entities are used.

invented entities (1)

rollout card no independent evidence
purpose: A publication bundle that preserves the rollout record and declares the views, reporting rules, and drops manifests behind reported scores
New artifact defined by the paper to serve as the unit of reproducibility

pith-pipeline@v0.9.0 · 5586 in / 1398 out tokens · 29281 ms · 2026-05-13T05:55:10.056608+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We treat rollout records, not reported scores, as the unit of reproducibility for agent research.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

changing only the reporting rule can change reported scores by 20.9 absolute percentage points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

URLhttps://arxiv.org/abs/2502.01600. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261. Tongwei Deng, Xutian Li, Runbang Yan, Wanning Li, Yifeng Zhu...

work page doi:10.1145/3458723 2025
[2]

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle

URLhttps://arxiv.org/abs/2403.13793. Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435. Y...

work page arXiv 2019
[3]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

URLhttps://openreview.net/forum?id=dHng2O0Jjr. Enrico Saccon, Ahmet Tikna, Davide De Martini, Edoardo Lamon, Luigi Palopoli, and Marco Roveri. A temporal planning framework for multi-agent systems via llm-aided knowledge base management, 2025. URLhttps://arxiv.org/abs/2502.19135. Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit A...

work page doi:10.1145/3689031.3696075 2025
[4]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

URLhttps://openreview.net/forum?id=V7HRrxXUhN. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. Multi-agent collaboration mechanisms: A survey of llms, 2025. URL https://arxiv.org/abs/2501.06322. UK AI Safety Institute. Inspect AI: Framework for large language model evaluations. https: //github.com/UKGov...

work page internal anchor Pith review arXiv 2025
[5]

URLhttps://openreview.net/forum?id=q7TxGUWlhD. Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforce...

work page arXiv 2025
[6]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

URLhttps://arxiv.org/abs/2405.15793. John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025a. URLhttps://arxiv.org/abs/2504.21798. Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning L...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

public-release documentation

URLhttps://openreview.net/forum?id=9ZPegFuFTFv. Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. URLhttp...

work page doi:10.18653/v1/2024.emnlp-demo.17 2024
[8]

(Direct empirical evidence.)

Tier 1 — strongest.Same model and workload; reported scores differ across two harnesses. (Direct empirical evidence.)

work page
[9]

(Vendor-admission-equivalent.)

Tier 2.Explicit vendor or framework documentation acknowledging the convention differ- ence between harnesses. (Vendor-admission-equivalent.)

work page
[10]

(Structural difference visible at source.)

Tier 3.Code-level convention difference with no cross-framework normalisation, published under the same metric name. (Structural difference visible at source.)

work page
[11]

post-mortem

Tier 4.Maintainer issue-tracker statement that the convention differs from another frame- work. (Maintainer-acknowledged difference.) Anything weaker is demoted to near-misses (§C.8). Every pair below cites either a numeric gap, a vendor acknowledgement, a code-level convention difference, or a maintainer statement. The “Held” column classifies counterfac...

work page arXiv 2023
[12]

No IRB review is applicable

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page