arxiv: 2604.26653 · v1 · submitted 2026-04-29 · 💻 cs.IR

Recognition: unknown

AgentSim: A Platform for Verifiable Agent-Trace Simulation

Saber Zerhoudi , Michael Granitzer , Jelena Mitrovic

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:34 UTC · model grok-4.3

classification 💻 cs.IR

keywords AgentSimverifiable agent tracesRAG simulationgrounded reasoninginformation retrievalagent-trace corpusLLM training data

0 comments

The pith

AgentSim generates verifiable stepwise reasoning traces for RAG agents over document collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentSim as a platform to create data that shows how agents reason step by step over documents rather than just final answers. Existing datasets lack this grounding to specific sources. By using exploration policies and a validation process with multiple models and selective human review, it creates traces that can be checked against the original documents. The resulting Agent-Trace Corpus covers three IR benchmarks with over 103,000 steps and claims complete grounding for substantive answers. This matters because trustworthy agentic LLMs need training data that reveals the process, not only outcomes.

Core claim

AgentSim is an open-source platform that simulates RAG agents to generate verifiable traces of their reasoning process over any document collection, using Corpus-Aware Seeding for diversity and Active Validation with multi-model checks and human review on disagreements. It enables the release of the Agent-Trace Corpus containing more than 103,000 grounded reasoning steps across three IR benchmarks, with 100% grounding rate on substantive answers, plus behavioral analysis of models.

What carries the argument

The AgentSim platform's combination of Corpus-Aware Seeding to explore documents broadly and Active Validation pipeline that routes disagreements to humans.

Load-bearing premise

The multi-model validation and human review process reliably identifies and corrects any ungrounded steps without missing subtle grounding failures or introducing new errors.

What would settle it

Manually inspecting a random sample of 100 traces from the ATC to check if every substantive claim is directly supported by the cited document sections.

Figures

Figures reproduced from arXiv: 2604.26653 by Jelena Mitrovic, Michael Granitzer, Saber Zerhoudi.

**Figure 1.** Figure 1: Overview of the AgentSim Workflow and Agent-Trace Corpus. view at source ↗

**Figure 2.** Figure 2: Exploration breadth and retrieval redundancy across three datasets (1,000 seed queries each). Bar height indicates view at source ↗

read the original abstract

Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we construct and release the Agent-Trace Corpus (ATC), a large collection of grounded reasoning trajectories spanning three established IR benchmarks. We make three contributions: (1) the AgentSim platform with two mechanisms, Corpus-Aware Seeding and Active Validation, that improve trace diversity and quality; (2) the Agent-Trace Corpus (ATC), over 103,000 verifiable reasoning steps spanning three IR benchmarks, with 100% grounding rate on substantive answers; and (3) a comparative behavioral analysis revealing systematic differences in how state-of-the-art models approach information seeking. Platform, toolkit, and corpus are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentSim gives a usable platform and released corpus for generating grounded RAG agent traces, but the 100% grounding claim still needs tighter evidence on validation rules.

read the letter

The core offering here is a simulation platform that produces step-by-step traces of retrieval-augmented agents over document collections, plus the open release of the ATC corpus with 103k steps across three IR benchmarks. The two mechanisms—Corpus-Aware Seeding to encourage broad document exploration and Active Validation that routes human review to model disagreements—directly target the gaps in existing QA and web-agent datasets, which either lack document grounding or focus on interface clicks instead of reasoning steps. Releasing the code, toolkit, and corpus is a concrete step that lets others build on it for training more reliable agentic systems. The behavioral comparison across models also adds some practical insight into how different LLMs handle information seeking. The 100% grounding rate on substantive answers is the headline result, but it hinges on the multi-model pipeline and selective human review. The description leaves open exactly what counts as a substantive step versus a minor one, and how disagreements between models are settled before human input. If the full paper includes concrete examples, disagreement resolution rules, or an error analysis showing what gets filtered, that would make the claim much stronger. Right now the rate could partly reflect post-selection rather than exhaustive enforcement at every step. This paper is aimed at groups working on verifiable agent data for RAG workflows. It is worth a serious referee because the platform and corpus are real deliverables that address a documented shortage, even if the validation details need more scrutiny before the grounding guarantee can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentSim, an open-source platform for simulating RAG agents to produce verifiable, stepwise reasoning traces over document collections using Corpus-Aware Seeding and Active Validation mechanisms. It releases the Agent-Trace Corpus (ATC) comprising over 103,000 reasoning steps across three IR benchmarks with a claimed 100% grounding rate on substantive answers, and includes a comparative analysis of behavioral differences among state-of-the-art models in information-seeking tasks.

Significance. If the validation pipeline reliably enforces grounding, the work supplies a much-needed resource for training trustworthy agentic LLMs by providing document-tied reasoning trajectories rather than outcome-only or untethered CoT data. The public release of the platform, toolkit, and corpus is a clear strength that supports reproducibility and downstream research on verifiable agents.

major comments (2)

[Abstract] Abstract: the central claim of a '100% grounding rate on substantive answers' is load-bearing for the ATC contribution and the behavioral analysis, yet provides no explicit definition of 'substantive answer' versus non-substantive steps nor the precise disagreement-resolution rule (unanimous, majority threshold, or human-override protocol) used in the multi-model validation pipeline plus selective human-in-the-loop review.
[Abstract] Abstract (validation pipeline description): without details on how grounding is verified at every substantive step or how model disagreements are resolved, it remains unclear whether the 100% rate results from exhaustive enforcement or from post-hoc filtering of traces; this directly affects the corpus's suitability for training agents on verifiably grounded trajectories.

minor comments (2)

[Abstract] The abstract refers to 'three established IR benchmarks' without naming them; this should be stated explicitly for immediate clarity.
[Abstract] The behavioral analysis is listed as a contribution but the abstract gives no quantitative summary or reference to specific figures/tables showing the systematic differences; adding a one-sentence overview would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater clarity in the abstract regarding key definitions and the validation process. We have revised the abstract to incorporate explicit definitions and a concise description of the pipeline, while preserving the manuscript's core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a '100% grounding rate on substantive answers' is load-bearing for the ATC contribution and the behavioral analysis, yet provides no explicit definition of 'substantive answer' versus non-substantive steps nor the precise disagreement-resolution rule (unanimous, majority threshold, or human-override protocol) used in the multi-model validation pipeline plus selective human-in-the-loop review.

Authors: We agree that the abstract requires these clarifications to support the central claim. In the revised manuscript, we define a substantive answer as any reasoning step that directly retrieves or synthesizes information from the source documents to advance the query resolution (as opposed to non-substantive procedural steps such as initial planning or final output formatting). The disagreement-resolution rule employs a majority threshold across the three validating models for automatic acceptance, with human override applied only in cases of unanimous disagreement or when the step is flagged as ambiguous during active validation. These details, drawn from the full description in Section 3.2, have been added to the abstract. revision: yes
Referee: [Abstract] Abstract (validation pipeline description): without details on how grounding is verified at every substantive step or how model disagreements are resolved, it remains unclear whether the 100% rate results from exhaustive enforcement or from post-hoc filtering of traces; this directly affects the corpus's suitability for training agents on verifiably grounded trajectories.

Authors: The full manuscript (Section 3.3) specifies that grounding is verified at every substantive step through independent cross-checks by multiple models against the cited documents in the collection. Model disagreements are resolved exclusively via the active human-in-the-loop mechanism, which routes only disputed steps for human review while automatically accepting unanimous or majority-supported steps. The 100% grounding rate results from this exhaustive per-step enforcement applied to all generated traces; no post-generation filtering of validated traces occurs. We have expanded the abstract to briefly describe this verification and resolution process, confirming the corpus contains only exhaustively validated trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: systems platform and corpus release with independent validation claims

full rationale

The paper introduces AgentSim as an open-source platform for simulating RAG agents and releases the ATC corpus of reasoning trajectories. Its claims center on the platform's mechanisms (Corpus-Aware Seeding and Active Validation) and the resulting corpus properties, including a reported 100% grounding rate achieved via multi-model validation plus selective human review. No equations, fitted parameters, predictions, or self-citations appear in the derivation chain; the outputs are defined directly by the described construction process without reduction to prior inputs or self-referential definitions. This is a standard self-contained systems and data-release contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and dataset paper with no mathematical derivations. No free parameters, axioms, or invented entities are introduced or required for the central claims.

pith-pipeline@v0.9.0 · 5541 in / 1407 out tokens · 41204 ms · 2026-05-07T12:34:53.548521+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harki- rat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. 2025. Phi-4-reasoning technical report

2025
[2]

Leif Azzopardi, Timo Breuer, Björn Engelmann, Christin Kreutz, Sean MacA- vaney, David Maxwell, Andrew Parry, Adam Roegiest, Xi Wang, and Saber Zerhoudi. 2024. SimIIR 3: A framework for the simulation of interactive and con- versational information retrieval. InProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development i...

2024
[3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. MS MARCO: A human generated machine reading comprehension dataset

2016
[4]

Krisztian Balog, Nolwenn Bernard, Saber Zerhoudi, and ChengXiang Zhai. 2025. Theory and Toolkits for User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 4138–4141

2025
[5]

Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2018. Elas- tic chatnoir: Search engine for the clueweb and the common crawl. InEuropean conference on information retrieval. Springer, 820–824

2018
[6]

Alexander Bondarenko, Magdalena Wolska, Stefan Heindorf, Lukas Blübaum, Axel-Cyrille Ngonga Ngomo, Benno Stein, Pavel Braslavski, Matthias Hagen, and Martin Potthast. 2022. CausalQA: A benchmark for causal question answering. InProceedings of the 29th International Conference on Computational Linguistics. 3296–3308

2022
[7]

Laura Caspari, Kanishka Ghosh Dastidar, Saber Zerhoudi, Jelena Mitrovic, and Michael Granitzer. 2024. Beyond benchmarks: Evaluating embedding model simi- larity for retrieval augmented generation systems.arXiv preprint arXiv:2407.08275 (2024)

work page arXiv 2024
[8]

Rafael Teixeira De Lima, Shubham Gupta, Cesar Berrospi Ramis, Lokesh Mishra, Michele Dolfi, Peter Staar, and Panagiotis Vagenas. 2025. Know your RAG: Dataset taxonomy and generation strategies for evaluating RAG systems. 39– 57 pages

2025
[9]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36, 28091–28114

2023
[10]

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. 2025. Openthoughts: Data recipes for reasoning models

2025
[11]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics7, 453–466

2019
[12]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020
[13]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step

2023
[14]

Pleias. 2025. SYNTH: the new data frontier. https://pleias.fr/blog/blogsynth-the- new-data-frontier. Accessed: 2025-11-12

2025
[15]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2020. Alfworld: Aligning text and em- bodied environments for interactive learning

2020
[16]

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493(2023)

work page arXiv 2023
[17]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171

work page internal anchor Pith review arXiv 2022
[18]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models. 24824–24837 pages

2022
[19]

Yexin Wu, Zhuosheng Zhang, and Hai Zhao. 2024. Mitigating misleading chain- of-thought reasoning with selective filtering. 11325–11340 pages

2024
[20]

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. 2024. Evaluation of retrieval-augmented generation: A survey. 102–120

2024
[21]

Saber Zerhoudi and Michael Granitzer. 2024. Personarag: Enhancing retrieval- augmented generation systems with user-centric agents.arXiv preprint arXiv:2407.09394(2024)

work page arXiv 2024
[22]

Saber Zerhoudi and Michael Granitzer. 2026. Beyond the Click: A Framework for Inferring Cognitive Traces in Search. InEuropean Conference on Information Retrieval. Springer, 626–640

2026
[23]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. 2023. Webarena: A realistic web environment for building autonomous agents

2023