Recognition: 2 theorem links
· Lean TheoremRollout Cards: A Reproducibility Standard for Agent Research
Pith reviewed 2026-05-13 05:55 UTC · model grok-4.3
The pith
Rollout records should replace reported scores as the unit of reproducibility in agent research, enabling re-grading under different rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. Re-grading preserved benchmark outputs shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models.
What carries the argument
Rollout cards, which are publication bundles preserving the rollout record and declaring the views, reporting rules, and drops manifests used to compute reported scores.
If this is right
- Four partial public releases in tool safety, multi-agent systems, theorem proving, and search allow new analyses not in the original reports.
- Re-grading across short-answer, code-generation, and tool-use tasks reveals score changes of up to 20.9 points from rule variations alone.
- Model rankings can invert when the same rollout data is scored under different rules.
- Open-source implementation in Ergon and published rollout-card exports support standardized evaluation across tool use, software engineering, web interaction, multi-agent coordination, safety, and search.
- None of the 50 audited repositories report how many runs failed, errored, or were skipped alongside headline scores.
Where Pith is reading between the lines
- Adopting rollout cards could reduce the need to re-run experiments when evaluation criteria evolve.
- Storing rollout records might enable automated detection of inconsistencies in reported metrics.
- This standard could extend to other domains like robotics where trajectory data is central to performance claims.
- Future benchmarks might require rollout card submissions as a condition for publication.
Load-bearing premise
The full rollout records contain enough information to reconstruct and re-evaluate any reported score under alternative rules without needing additional hidden state or implementation details from the original system.
What would settle it
Finding a preserved rollout record where applying the declared reporting rule does not reproduce the original reported score, or where a new rule cannot be applied due to missing information in the record.
Figures
read the original abstract
Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that agent research reproducibility is undermined by reliance on reported scores without preserving rollout records, as the same behavior can yield different scores under varying selection or reporting rules. A structured audit of 50 repositories finds none report failure/error/skip counts and documents 37 cases where rules alter success rates, costs, or timings for fixed evidence. It proposes rollout cards as bundles preserving full records plus declarations of views, rules, and drop manifests. Validation occurs in two settings: re-analysis of partial public releases across domains and re-grading of preserved outputs showing up to 20.9 pp score shifts with occasional frontier-model ranking inversions. A reference implementation is released in Ergon with public benchmark exports.
Significance. If the empirical results hold, this provides a practical, record-based standard that could meaningfully advance reproducibility practices in agent research, where tasks involve tool use, web interaction, and multi-agent coordination. Strengths include the open release of code and data exports, multi-domain validation, and emphasis on falsifiable rollout preservation over self-referential scores. The work addresses a timely gap as agentic systems scale.
major comments (3)
- Abstract: The structured audit of 50 repositories and the identification of 37 cases provide no methodological details on repository selection, search protocol, case identification criteria, or the exact re-grading procedure used to obtain the 20.9 pp figure. This is load-bearing for the central quantitative claims.
- Validation settings (second setting): The re-grading experiment that produces score shifts and ranking inversions assumes preserved outputs contain every datum required by alternative scoring functions (full tool responses, error traces, exact kept/dropped subsets). The manuscript does not demonstrate that the proposed rollout-card format captures these elements without re-execution or private state, leaving the headline result vulnerable.
- Reproducibility proposal: It is not shown that rollout cards suffice to reconstruct any reported score under arbitrary rules for tool-use and web-interaction tasks, where hidden environment responses or selection metadata may be required; the audit documents omissions but does not close this gap.
minor comments (1)
- Abstract: The four partial public releases used in the first validation setting are not named or linked, reducing immediate verifiability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where our manuscript requires greater methodological transparency and clarification. We address each major comment point by point below, indicating where revisions will be made to the next version of the paper.
read point-by-point responses
-
Referee: Abstract: The structured audit of 50 repositories and the identification of 37 cases provide no methodological details on repository selection, search protocol, case identification criteria, or the exact re-grading procedure used to obtain the 20.9 pp figure. This is load-bearing for the central quantitative claims.
Authors: We agree that the manuscript currently lacks the necessary methodological details on how the 50 repositories were selected, the search protocol employed, the criteria used to identify the 37 cases, and the precise steps in the re-grading procedure yielding the 20.9 percentage point shifts. These elements are indeed central to the quantitative claims. In the revised manuscript, we will expand the relevant sections (primarily Methods and the description of the audit) to include: explicit criteria for repository selection (e.g., popularity via GitHub stars, citations, and relevance to agent benchmarks), the search terms and sources used, the identification criteria for cases where reporting rules alter outcomes, and a step-by-step account of the re-grading process with examples of the rules applied and how preserved outputs were processed. revision: yes
-
Referee: Validation settings (second setting): The re-grading experiment that produces score shifts and ranking inversions assumes preserved outputs contain every datum required by alternative scoring functions (full tool responses, error traces, exact kept/dropped subsets). The manuscript does not demonstrate that the proposed rollout-card format captures these elements without re-execution or private state, leaving the headline result vulnerable.
Authors: The referee correctly notes that the re-grading results rest on an assumption about the completeness of preserved outputs. The rollout-card format is designed to bundle full rollout records (including tool responses, error traces, and drop manifests), but the manuscript does not provide an explicit demonstration or verification that this format captures all required data for alternative scoring functions across the evaluated tasks without needing re-execution or private state. We will revise the Validation section to include the detailed rollout-card schema, concrete examples from the short-answer, code-generation, and tool-use tasks illustrating how the preserved elements enable re-grading, and an expanded limitations discussion addressing cases (such as certain web interactions) where additional metadata may still be required. revision: partial
-
Referee: Reproducibility proposal: It is not shown that rollout cards suffice to reconstruct any reported score under arbitrary rules for tool-use and web-interaction tasks, where hidden environment responses or selection metadata may be required; the audit documents omissions but does not close this gap.
Authors: We do not claim that rollout cards enable reconstruction of scores under arbitrary rules in every conceivable case, especially for tool-use and web-interaction tasks that may depend on hidden environment responses or private selection metadata not present in the rollout record. The manuscript positions rollout cards as a standard that preserves complete records and declares the specific views, rules, and drop manifests used, thereby addressing the documented omissions in existing repositories and enabling re-analysis where the preserved data suffices. The two validation settings illustrate this in practice. We will add clarifying language in the Reproducibility proposal and Discussion sections to delineate the scope and limitations, but we do not intend to alter the core proposal or add a universal proof, as that exceeds what the current evidence supports. revision: no
Circularity Check
No significant circularity: claims rest on external audit and re-grading
full rationale
The paper's central results derive from a structured audit of 50 public repositories (finding zero reports of failure counts) and re-grading experiments on preserved benchmark outputs that produce the 20.9 pp score shifts. These quantities are computed directly from existing rollout data under alternative reporting rules; they do not reduce to any quantity defined by the rollout-card proposal itself. The introduction of rollout cards is a definitional standardization step whose validation uses independent evidence rather than self-referential fitting, self-citation chains, or ansatz smuggling. No equations, fitted parameters, or uniqueness theorems appear in the derivation.
Axiom & Free-Parameter Ledger
invented entities (1)
-
rollout card
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We treat rollout records, not reported scores, as the unit of reproducibility for agent research.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
changing only the reporting rule can change reported scores by 20.9 absolute percentage points
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021
URLhttps://arxiv.org/abs/2502.01600. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261. Tongwei Deng, Xutian Li, Runbang Yan, Wanning Li, Yifeng Zhu...
-
[2]
URLhttps://arxiv.org/abs/2403.13793. Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program).J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435. Y...
-
[3]
HybridFlow: A Flexible and Efficient RLHF Framework , url=
URLhttps://openreview.net/forum?id=dHng2O0Jjr. Enrico Saccon, Ahmet Tikna, Davide De Martini, Edoardo Lamon, Luigi Palopoli, and Marco Roveri. A temporal planning framework for multi-agent systems via llm-aided knowledge base management, 2025. URLhttps://arxiv.org/abs/2502.19135. Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit A...
-
[4]
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
URLhttps://openreview.net/forum?id=V7HRrxXUhN. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. Multi-agent collaboration mechanisms: A survey of llms, 2025. URL https://arxiv.org/abs/2501.06322. UK AI Safety Institute. Inspect AI: Framework for large language model evaluations. https: //github.com/UKGov...
work page internal anchor Pith review arXiv 2025
-
[5]
URLhttps://openreview.net/forum?id=q7TxGUWlhD. Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforce...
-
[6]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
URLhttps://arxiv.org/abs/2405.15793. John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025a. URLhttps://arxiv.org/abs/2504.21798. Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning L...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
URLhttps://openreview.net/forum?id=9ZPegFuFTFv. Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. URLhttp...
-
[8]
Tier 1 — strongest.Same model and workload; reported scores differ across two harnesses. (Direct empirical evidence.)
-
[9]
(Vendor-admission-equivalent.)
Tier 2.Explicit vendor or framework documentation acknowledging the convention differ- ence between harnesses. (Vendor-admission-equivalent.)
-
[10]
(Structural difference visible at source.)
Tier 3.Code-level convention difference with no cross-framework normalisation, published under the same metric name. (Structural difference visible at source.)
-
[11]
Tier 4.Maintainer issue-tracker statement that the convention differs from another frame- work. (Maintainer-acknowledged difference.) Anything weaker is demoted to near-misses (§C.8). Every pair below cites either a numeric gap, a vendor acknowledgement, a code-level convention difference, or a maintainer statement. The “Held” column classifies counterfac...
-
[12]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.