Recognition: unknown
AgentSim: A Platform for Verifiable Agent-Trace Simulation
Pith reviewed 2026-05-07 12:34 UTC · model grok-4.3
The pith
AgentSim generates verifiable stepwise reasoning traces for RAG agents over document collections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentSim is an open-source platform that simulates RAG agents to generate verifiable traces of their reasoning process over any document collection, using Corpus-Aware Seeding for diversity and Active Validation with multi-model checks and human review on disagreements. It enables the release of the Agent-Trace Corpus containing more than 103,000 grounded reasoning steps across three IR benchmarks, with 100% grounding rate on substantive answers, plus behavioral analysis of models.
What carries the argument
The AgentSim platform's combination of Corpus-Aware Seeding to explore documents broadly and Active Validation pipeline that routes disagreements to humans.
Load-bearing premise
The multi-model validation and human review process reliably identifies and corrects any ungrounded steps without missing subtle grounding failures or introducing new errors.
What would settle it
Manually inspecting a random sample of 100 traces from the ATC to check if every substantive claim is directly supported by the cited document sections.
Figures
read the original abstract
Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we construct and release the Agent-Trace Corpus (ATC), a large collection of grounded reasoning trajectories spanning three established IR benchmarks. We make three contributions: (1) the AgentSim platform with two mechanisms, Corpus-Aware Seeding and Active Validation, that improve trace diversity and quality; (2) the Agent-Trace Corpus (ATC), over 103,000 verifiable reasoning steps spanning three IR benchmarks, with 100% grounding rate on substantive answers; and (3) a comparative behavioral analysis revealing systematic differences in how state-of-the-art models approach information seeking. Platform, toolkit, and corpus are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgentSim, an open-source platform for simulating RAG agents to produce verifiable, stepwise reasoning traces over document collections using Corpus-Aware Seeding and Active Validation mechanisms. It releases the Agent-Trace Corpus (ATC) comprising over 103,000 reasoning steps across three IR benchmarks with a claimed 100% grounding rate on substantive answers, and includes a comparative analysis of behavioral differences among state-of-the-art models in information-seeking tasks.
Significance. If the validation pipeline reliably enforces grounding, the work supplies a much-needed resource for training trustworthy agentic LLMs by providing document-tied reasoning trajectories rather than outcome-only or untethered CoT data. The public release of the platform, toolkit, and corpus is a clear strength that supports reproducibility and downstream research on verifiable agents.
major comments (2)
- [Abstract] Abstract: the central claim of a '100% grounding rate on substantive answers' is load-bearing for the ATC contribution and the behavioral analysis, yet provides no explicit definition of 'substantive answer' versus non-substantive steps nor the precise disagreement-resolution rule (unanimous, majority threshold, or human-override protocol) used in the multi-model validation pipeline plus selective human-in-the-loop review.
- [Abstract] Abstract (validation pipeline description): without details on how grounding is verified at every substantive step or how model disagreements are resolved, it remains unclear whether the 100% rate results from exhaustive enforcement or from post-hoc filtering of traces; this directly affects the corpus's suitability for training agents on verifiably grounded trajectories.
minor comments (2)
- [Abstract] The abstract refers to 'three established IR benchmarks' without naming them; this should be stated explicitly for immediate clarity.
- [Abstract] The behavioral analysis is listed as a contribution but the abstract gives no quantitative summary or reference to specific figures/tables showing the systematic differences; adding a one-sentence overview would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for greater clarity in the abstract regarding key definitions and the validation process. We have revised the abstract to incorporate explicit definitions and a concise description of the pipeline, while preserving the manuscript's core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a '100% grounding rate on substantive answers' is load-bearing for the ATC contribution and the behavioral analysis, yet provides no explicit definition of 'substantive answer' versus non-substantive steps nor the precise disagreement-resolution rule (unanimous, majority threshold, or human-override protocol) used in the multi-model validation pipeline plus selective human-in-the-loop review.
Authors: We agree that the abstract requires these clarifications to support the central claim. In the revised manuscript, we define a substantive answer as any reasoning step that directly retrieves or synthesizes information from the source documents to advance the query resolution (as opposed to non-substantive procedural steps such as initial planning or final output formatting). The disagreement-resolution rule employs a majority threshold across the three validating models for automatic acceptance, with human override applied only in cases of unanimous disagreement or when the step is flagged as ambiguous during active validation. These details, drawn from the full description in Section 3.2, have been added to the abstract. revision: yes
-
Referee: [Abstract] Abstract (validation pipeline description): without details on how grounding is verified at every substantive step or how model disagreements are resolved, it remains unclear whether the 100% rate results from exhaustive enforcement or from post-hoc filtering of traces; this directly affects the corpus's suitability for training agents on verifiably grounded trajectories.
Authors: The full manuscript (Section 3.3) specifies that grounding is verified at every substantive step through independent cross-checks by multiple models against the cited documents in the collection. Model disagreements are resolved exclusively via the active human-in-the-loop mechanism, which routes only disputed steps for human review while automatically accepting unanimous or majority-supported steps. The 100% grounding rate results from this exhaustive per-step enforcement applied to all generated traces; no post-generation filtering of validated traces occurs. We have expanded the abstract to briefly describe this verification and resolution process, confirming the corpus contains only exhaustively validated trajectories. revision: yes
Circularity Check
No circularity: systems platform and corpus release with independent validation claims
full rationale
The paper introduces AgentSim as an open-source platform for simulating RAG agents and releases the ATC corpus of reasoning trajectories. Its claims center on the platform's mechanisms (Corpus-Aware Seeding and Active Validation) and the resulting corpus properties, including a reported 100% grounding rate achieved via multi-model validation plus selective human review. No equations, fitted parameters, predictions, or self-citations appear in the derivation chain; the outputs are defined directly by the described construction process without reduction to prior inputs or self-referential definitions. This is a standard self-contained systems and data-release contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harki- rat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. 2025. Phi-4-reasoning technical report
2025
-
[2]
Leif Azzopardi, Timo Breuer, Björn Engelmann, Christin Kreutz, Sean MacA- vaney, David Maxwell, Andrew Parry, Adam Roegiest, Xi Wang, and Saber Zerhoudi. 2024. SimIIR 3: A framework for the simulation of interactive and con- versational information retrieval. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development i...
2024
-
[3]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. MS MARCO: A human generated machine reading comprehension dataset
2016
-
[4]
Krisztian Balog, Nolwenn Bernard, Saber Zerhoudi, and ChengXiang Zhai. 2025. Theory and Toolkits for User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 4138–4141
2025
-
[5]
Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2018. Elas- tic chatnoir: Search engine for the clueweb and the common crawl. InEuropean conference on information retrieval. Springer, 820–824
2018
-
[6]
Alexander Bondarenko, Magdalena Wolska, Stefan Heindorf, Lukas Blübaum, Axel-Cyrille Ngonga Ngomo, Benno Stein, Pavel Braslavski, Matthias Hagen, and Martin Potthast. 2022. CausalQA: A benchmark for causal question answering. InProceedings of the 29th International Conference on Computational Linguistics. 3296–3308
2022
- [7]
-
[8]
Rafael Teixeira De Lima, Shubham Gupta, Cesar Berrospi Ramis, Lokesh Mishra, Michele Dolfi, Peter Staar, and Panagiotis Vagenas. 2025. Know your RAG: Dataset taxonomy and generation strategies for evaluating RAG systems. 39– 57 pages
2025
-
[9]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36, 28091–28114
2023
-
[10]
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. 2025. Openthoughts: Data recipes for reasoning models
2025
-
[11]
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics7, 453–466
2019
-
[12]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474
2020
-
[13]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step
2023
-
[14]
Pleias. 2025. SYNTH: the new data frontier. https://pleias.fr/blog/blogsynth-the- new-data-frontier. Accessed: 2025-11-12
2025
-
[15]
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2020. Alfworld: Aligning text and em- bodied environments for interactive learning
2020
- [16]
-
[17]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171
work page internal anchor Pith review arXiv 2022
-
[18]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models. 24824–24837 pages
2022
-
[19]
Yexin Wu, Zhuosheng Zhang, and Hai Zhao. 2024. Mitigating misleading chain- of-thought reasoning with selective filtering. 11325–11340 pages
2024
-
[20]
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2024. Evaluation of retrieval-augmented generation: A survey. 102–120
2024
- [21]
-
[22]
Saber Zerhoudi and Michael Granitzer. 2026. Beyond the Click: A Framework for Inferring Cognitive Traces in Search. InEuropean Conference on Information Retrieval. Springer, 626–640
2026
-
[23]
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.