WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

Hengrui Gu; Kaixiong Zhou; Xiaotian Han

arxiv: 2606.02908 · v1 · pith:EJWNYHTDnew · submitted 2026-06-01 · 💻 cs.CL · cs.AI

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

Hengrui Gu , Xiaotian Han , Kaixiong Zhou This is my paper

Pith reviewed 2026-06-28 14:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-turn agentstrajectory synthesiswrite-read intensiveevidence burdenagent trainingSFT datauser intent inferencetool use

0 comments

The pith

WRIT synthesizes trajectories stressing both write count and evidence burden so a 4B model beats GPT-5.1 no-think on multi-turn agent benchmarks with 2K examples and lower token use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-turn agents must infer incomplete intents, gather missing details via dialogue and tools, then execute actions. Existing synthesis pipelines mainly lengthen tasks by chaining writes, but the paper argues that individual write decisions can themselves demand heavy prior reading and comparison. WRIT therefore generates tasks high on both the number of writes and the evidence load per write, diversifies user instructions, and runs executable simulations to produce full trajectories. Training on the resulting 2K trajectories internalizes evidence-grounded behavior so that a compact model outperforms a larger baseline while spending fewer tokens at inference time.

Core claim

WRIT first creates write-intensive and read-heavy tasks, then diversifies user behavior instructions, and finally simulates agent-user interactions in an executable environment to yield complete training trajectories. The resulting data trains agents for both longer sequential execution and robust decision making under high information load. A 4B model trained on only 2K such trajectories outperforms GPT-5.1 no-think on τ²-bench while substantially reducing inference-time token usage.

What carries the argument

WRIT pipeline that generates tasks along two axes (write-decision count and per-decision evidence burden), diversifies instructions, and simulates interactions to produce trajectories.

If this is right

Agents learn to gather and compare substantial read-tool evidence before committing to write arguments.
Compact supervised fine-tuning data can encode part of what would otherwise require expensive test-time reasoning.
Inference token consumption drops because evidence-grounded decisions become internalized behavior.
Performance gains appear on benchmarks that test multi-turn intent inference and tool use under incomplete information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-axis synthesis could be applied to single-turn agents if read burden is isolated as the dominant variable.
Domains with high-stakes decisions under partial information, such as medical or legal assistants, might benefit from analogous read-write balancing.
If simulation fidelity is high, the approach points toward cheaper data pipelines that reduce dependence on frontier-model test-time compute for agent deployment.

Load-bearing premise

The simulated agent-user interactions produce trajectories whose distribution matches real user behavior and tool responses sufficiently for performance gains to transfer.

What would settle it

If a 4B model trained on WRIT trajectories fails to outperform the GPT-5.1 no-think baseline when evaluated on a held-out collection of real human multi-turn dialogues, the transfer claim does not hold.

Figures

Figures reproduced from arXiv: 2606.02908 by Hengrui Gu, Kaixiong Zhou, Xiaotian Han.

**Figure 1.** Figure 1: Overview of the WRIT pipeline. gold write-action sequence Agold, and a gold final state sgold. This subsection focuses entirely on task synthesis; the simulation that turns tasks into trajectories is introduced later in Section 3.3. We control task complexity through following two branches. 3.1.1 Write-intensive task synthesis. This branch synthesizes trajectories that cover the core write operations of t… view at source ↗

**Figure 2.** Figure 2: Passk curves for Qwen3-4B-Instruct-2507. The horizontal axis indexes k = 1, 2, 3, 4. contrast, WRIT outperforms GPT-5.1 no-think on both domains, improving the average Pass1 from 62.80 to 67.99, while using fewer output tokens. This suggests that our synthesized trajectories transfer part of the required evidence-gathering and policy-following behavior into the model parameters through SFT, allowing a sm… view at source ↗

**Figure 3.** Figure 3: Passk degradation curves for the full-size dataset comparison on τ 2 -bench using Qwen3-4B-Instruct-2507. The horizontal axis indexes k = 1, 2, 3, 4; panel titles report the number of evaluated tasks. Unlike the controlled 2K-budget main comparison, this setting trains each dataset at its available full scale, including APIGen-MT-5K, Simia-90K, CoVe-12K, AReaL-2K, and WRIT-2K. D Read-Heavy Subsets in τ 2 -… view at source ↗

**Figure 4.** Figure 4: Passk degradation curves on τ 2 -bench for Llama-3.1-8B-Instruct. The horizontal axis indexes k = 1, 2, 3, 4; panel titles report the number of evaluated tasks. 1 2 3 4 20 40 60 Pass k (%) Retail Full (n=114) APIGen-MT Simia CoVe AReaL WRIT 1 2 3 4 20 40 60 Pass k (%) Airline Full (n=50) APIGen-MT Simia CoVe AReaL WRIT 1 2 3 4 20 40 60 Pass k (%) Retail Hard (n=62) APIGen-MT Simia CoVe AReaL WRIT 1 2 3 4 0… view at source ↗

**Figure 5.** Figure 5: Passk degradation curves on τ 2 -bench for Qwen2.5-14B-Instruct. The horizontal axis indexes k = 1, 2, 3, 4; panel titles report the number of evaluated tasks. use. The base models are used for supervised finetuning and evaluation of tool-using agents, and the τ 2 -bench environments are used as benchmark settings for evaluating multi-turn user-facing taskcompletion agents. Public baseline datasets are u… view at source ↗

**Figure 6.** Figure 6: Passk curves for the ablation study on τ 2 -bench using Qwen3-4B-Instruct-2507. The horizontal axis indexes k = 1, 2, 3, 4; panel titles report the number of evaluated tasks. Domain # Tasks Task IDs Retail 62 2, 3, 4, 5, 8, 9, 19, 20, 21, 23, 24, 25, 26, 27, 29, 30, 31, 32, 35, 36, 37, 38, 45, 49, 53, 54, 55, 58, 62, 63, 64, 66, 68, 70, 71, 74, 76, 79, 81, 82, 83, 84, 85, 86, 87, 90, 91, 93, 94, 95, 98, 99… view at source ↗

read the original abstract

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution. We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $\tau^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WRIT frames trajectory synthesis around both decision count and per-decision evidence load, which is a clean extension of prior composition work, but the headline result rests on an unevaluated simulation step and an abstract with zero experimental details.

read the letter

The main thing to know is that this paper adds a second axis to synthetic trajectory generation for agents: alongside the number of write decisions, it varies how much read-tool evidence each decision requires. The pipeline generates tasks along both axes, diversifies user instructions, and runs them in an executable simulator to produce training data. With 2K trajectories they report a 4B model beating GPT-5.1 no-think on τ²-bench while cutting inference tokens.

What is actually new is the explicit contrast with write-intensive methods and the claim that evidence burden per decision is a distinct source of difficulty that composition alone does not train. The motivation is reasonable: a single tool call can be hard if the agent must first gather and compare information from multiple reads.

The soft spot is that none of this can be checked from the given text. The abstract states the performance result but supplies no baseline list, no description of how τ²-bench was run, no statistics on the generated trajectories, and no evidence that the simulator was calibrated against real user logs or tool responses. The distributional-match assumption flagged in the stress-test note therefore sits unaddressed, and it is load-bearing for the transfer claim.

This is for groups training compact models on multi-turn agent tasks who already work with synthetic data. A reader focused on data-generation pipelines would get a usable framing even if the numbers need verification. The work is coherent on its own terms and the claim is specific enough to test, so it deserves a serious referee once the methods and results sections are expanded.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes WRIT, a pipeline to synthesize multi-turn agent trajectories along two axes: number of write decisions and evidence burden per decision. It generates write-intensive and read-heavy tasks, diversifies user behavior instructions, and simulates complete agent-user interactions in an executable environment to produce training data. The central claim is that fine-tuning a 4B model on only 2K such trajectories yields better performance than GPT-5.1 no-think on τ²-bench while reducing inference-time token usage.

Significance. If the empirical result holds after proper validation, the work would show that compact, targeted SFT data can convert expensive test-time reasoning into efficient learned behavior for evidence-grounded multi-turn agents, offering a practical route to improve agent performance without scaling model size or inference compute.

major comments (2)

[Abstract] Abstract: the performance claim that a 4B model trained on 2K WRIT trajectories outperforms GPT-5.1 no-think on τ²-bench is presented with no experimental details, baseline comparisons, statistical tests, run counts, or description of how τ²-bench was executed, so the central empirical result cannot be evaluated.
[WRIT pipeline description] WRIT pipeline (final simulation step): the headline result requires that simulated trajectories reproduce the joint distribution of user intents, repairs, and tool responses in τ²-bench, yet no calibration against real logs, sampling from observed distributions, or matching of trajectory statistics (turn length, tool-call entropy) is reported; this assumption is load-bearing for transfer.

minor comments (1)

[Abstract] The underlining used for the WRIT acronym expansion may not render reliably across formats; consider standard emphasis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We respond to each major point below, indicating planned revisions where the manuscript can be strengthened.

read point-by-point responses

Referee: [Abstract] Abstract: the performance claim that a 4B model trained on 2K WRIT trajectories outperforms GPT-5.1 no-think on τ²-bench is presented with no experimental details, baseline comparisons, statistical tests, run counts, or description of how τ²-bench was executed, so the central empirical result cannot be evaluated.

Authors: We agree that the abstract presents the headline result concisely and omits the supporting experimental details. The full description of the evaluation protocol, including how τ²-bench was executed, all baselines, run counts, and statistical tests, appears in Section 4 and Appendix C. We will revise the abstract to include a brief clause referencing the evaluation setup so readers can locate the supporting evidence immediately. revision: yes
Referee: [WRIT pipeline description] WRIT pipeline (final simulation step): the headline result requires that simulated trajectories reproduce the joint distribution of user intents, repairs, and tool responses in τ²-bench, yet no calibration against real logs, sampling from observed distributions, or matching of trajectory statistics (turn length, tool-call entropy) is reported; this assumption is load-bearing for transfer.

Authors: The final simulation step generates trajectories inside an executable environment whose task structure is defined by the same write-decision and evidence-burden axes used to create the benchmark tasks. While we did not report explicit calibration to external user logs or quantitative matching of statistics such as tool-call entropy, the generation procedure is intentionally aligned with τ²-bench requirements. We will add a short limitations paragraph in Section 3.3 discussing this design choice and its implications for transfer. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical synthesis and benchmarking pipeline is self-contained

full rationale

The paper describes a data-generation pipeline (task generation, instruction diversification, environment simulation) followed by SFT and external benchmarking on τ²-bench. No equations, fitted parameters, or self-citations are invoked to derive the headline performance numbers; the 2K-trajectory result is an observed training outcome, not a quantity forced by construction from the synthesis steps themselves. The distributional-match assumption is an unvalidated modeling choice but does not create a definitional or self-referential reduction inside the reported claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that the generated tasks and simulated interactions are sufficiently realistic to produce transferable agent behavior; no free parameters, new entities, or additional axioms are visible in the abstract.

axioms (1)

domain assumption Synthesizing trajectories with controlled complexity improves downstream agent performance on benchmarks
Implicit foundation for the entire synthesis pipeline and the reported training result.

pith-pipeline@v0.9.1-grok · 5825 in / 1328 out tokens · 35845 ms · 2026-06-28T14:16:26.156150+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 15 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. Advances in Neural Information Processing Systems , volume=
[2]

arXiv preprint arXiv:2511.01824 , year=

Simulating Environments with Reasoning Models for Agent Training , author=. arXiv preprint arXiv:2511.01824 , year=

work page arXiv
[3]

arXiv preprint arXiv:2603.01940 , year=

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification , author=. arXiv preprint arXiv:2603.01940 , year=

work page arXiv
[4]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Effective Red-Teaming of Policy-Adherent Agents , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=. 2025 , publisher=

2025
[6]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents , author=. arXiv preprint arXiv:2601.20144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages=. 2018 , doi=

2018
[9]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=. 2016 , doi=

2016
[10]

arXiv preprint arXiv:2507.22034 , year=

UserBench: An Interactive Gym Environment for User-Centric Agents , author=. arXiv preprint arXiv:2507.22034 , year=

work page arXiv
[11]

Zhao, Weikang and Wang, Xili and Ma, Chengdi and Kong, Lingbin and Yang, Zhaohua and Tuo, Mingxiang and Shi, Xiaowei and Zhai, Yitao and Cai, Xunliang , journal=
[12]

Qin, Tian and Bai, Felix and Hu, Ting-Yao and Vemulapalli, Raviteja and Koppula, Hema Swetha and Xu, Zhiyang and Jin, Bowen and Cemri, Mert and Lu, Jiarui and Wang, Zirui and Cao, Meng , journal=
[13]

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Beyond Itinerary Planning: A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks , author=. arXiv preprint arXiv:2512.22673 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2510.18170 , year=

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI , author=. arXiv preprint arXiv:2510.18170 , year=

work page arXiv
[15]

Xu, Zhangchen and Soria, Adriana Meza and Tan, Shawn and Roy, Anurag and Agrawal, Ashish Sunil and Poovendran, Radha and Panda, Rameswar , journal=
[16]

Zeng, Xingshan and Liu, Weiwen and Wang, Lingzhi and Li, Liangyou and Mi, Fei and Wang, Yasheng and Shang, Lifeng and Jiang, Xin and Liu, Qun , journal=
[17]

Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene , journal=
[18]

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Burdisso, Sergio and Baroudi, S. arXiv preprint arXiv:2506.10622 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

arXiv preprint arXiv:2504.04736 , year=

Synthetic Data Generation and Multi-Step Reinforcement Learning for Reasoning and Tool Use , author=. arXiv preprint arXiv:2504.04736 , year=

work page arXiv
[20]

arXiv preprint arXiv:2601.22607 , year=

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents , author=. arXiv preprint arXiv:2601.22607 , year=

work page arXiv
[21]

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration , author=. arXiv preprint arXiv:2604.02869 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Lu, Jiarui and Holleis, Thomas and Zhang, Yizhe and Aumayer, Bernhard and Nan, Feng and Bai, Felix and Ma, Shuang and Ma, Shen and Li, Mengyu and Yin, Guoli and Wang, Zirui and Pang, Ruoming , journal=
[23]

and Kapanipathi, Pavan , journal=

Basu, Kinjal and Abdelaziz, Ibrahim and Kate, Kiran and Agarwal, Mayank and Crouse, Maxwell and Rizk, Yara and Bradford, Kelsey and Munawar, Asim and Kumaravel, Sadhana and Goyal, Saurabh and Wang, Xin and Lastras, Luis A. and Kapanipathi, Pavan , journal=
[24]

Chen, Chen and Hao, Xinlong and Liu, Weiwen and Huang, Xu and Zeng, Xingshan and Yu, Shuai and Li, Dexun and Wang, Shuai and Gan, Weinan and Huang, Yuefeng and Liu, Wulong and Wang, Xinzhi and Lian, Defu and Yin, Baoqun and Wang, Yasheng and Liu, Wu , journal=
[25]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
[27]

arXiv preprint arXiv:2509.13311 , year=

Towards General Agentic Intelligence via Environment Scaling , author=. arXiv preprint arXiv:2509.13311 , year=

work page arXiv
[28]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large Language Model Connected with Massive APIs , author=. arXiv preprint arXiv:2305.15334 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Li, Minghao and Song, Feifan and Yu, Bowen and Yu, Haiyang and Li, Zhoujun and Huang, Fei and Li, Yongbin , booktitle=
[30]

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle=
[31]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year=

Budzianowski, Pawe. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year=

2018
[32]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=
[33]

Taskmaster-1: Toward a realistic and diverse dialog dataset , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019
[34]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Identifying the Risks of LM Agents with an LM-Emulated Sandbox , author=. arXiv preprint arXiv:2309.15817 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases , author=. arXiv preprint arXiv:2306.05301 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2407.03502 , year=

Agentinstruct: Toward generative teaching with agentic flows , author=. arXiv preprint arXiv:2407.03502 , year=

work page arXiv
[38]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

2024
[42]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=
[43]

Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yao and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel , journal=
[44]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Workarena: How capable are web agents at solving common knowledge work tasks? , author=. arXiv preprint arXiv:2403.07718 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Appworld: A controllable world of apps and people for benchmarking interactive coding agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[46]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=
[47]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal=

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal=
[48]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

Multi-trait User Simulation with Adaptive Decoding for Conversational Task Assistants , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

2024
[49]

is_unstable

Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models , author=. arXiv preprint arXiv:2502.12813 , year=

work page arXiv
[50]

Computer Speech & Language , volume=

Prompting Large Language Models for User Simulation in Task-Oriented Dialogue Systems , author=. Computer Speech & Language , volume=. 2025 , publisher=

2025
[51]

IEEE Transactions on Computational Social Systems , volume=

Are Current Task-Oriented Dialogue Systems Able to Satisfy Impolite Users? , author=. IEEE Transactions on Computational Social Systems , volume=. 2025 , doi=

2025

[1] [1]

Advances in Neural Information Processing Systems , volume=

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay , author=. Advances in Neural Information Processing Systems , volume=

[2] [2]

arXiv preprint arXiv:2511.01824 , year=

Simulating Environments with Reasoning Models for Agent Training , author=. arXiv preprint arXiv:2511.01824 , year=

work page arXiv

[3] [3]

arXiv preprint arXiv:2603.01940 , year=

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification , author=. arXiv preprint arXiv:2603.01940 , year=

work page arXiv

[4] [4]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Effective Red-Teaming of Policy-Adherent Agents , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=. 2025 , publisher=

2025

[6] [6]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents , author=. arXiv preprint arXiv:2601.20144 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages=. 2018 , doi=

2018

[9] [9]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=. 2016 , doi=

2016

[10] [10]

arXiv preprint arXiv:2507.22034 , year=

UserBench: An Interactive Gym Environment for User-Centric Agents , author=. arXiv preprint arXiv:2507.22034 , year=

work page arXiv

[11] [11]

Zhao, Weikang and Wang, Xili and Ma, Chengdi and Kong, Lingbin and Yang, Zhaohua and Tuo, Mingxiang and Shi, Xiaowei and Zhai, Yitao and Cai, Xunliang , journal=

[12] [12]

Qin, Tian and Bai, Felix and Hu, Ting-Yao and Vemulapalli, Raviteja and Koppula, Hema Swetha and Xu, Zhiyang and Jin, Bowen and Cemri, Mert and Lu, Jiarui and Wang, Zirui and Cao, Meng , journal=

[13] [13]

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Beyond Itinerary Planning: A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks , author=. arXiv preprint arXiv:2512.22673 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2510.18170 , year=

AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI , author=. arXiv preprint arXiv:2510.18170 , year=

work page arXiv

[15] [15]

Xu, Zhangchen and Soria, Adriana Meza and Tan, Shawn and Roy, Anurag and Agrawal, Ashish Sunil and Poovendran, Radha and Panda, Rameswar , journal=

[16] [16]

Zeng, Xingshan and Liu, Weiwen and Wang, Lingzhi and Li, Liangyou and Mi, Fei and Wang, Yasheng and Shang, Lifeng and Jiang, Xin and Liu, Qun , journal=

[17] [17]

Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene , journal=

[18] [18]

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Burdisso, Sergio and Baroudi, S. arXiv preprint arXiv:2506.10622 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

arXiv preprint arXiv:2504.04736 , year=

Synthetic Data Generation and Multi-Step Reinforcement Learning for Reasoning and Tool Use , author=. arXiv preprint arXiv:2504.04736 , year=

work page arXiv

[20] [20]

arXiv preprint arXiv:2601.22607 , year=

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents , author=. arXiv preprint arXiv:2601.22607 , year=

work page arXiv

[21] [21]

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration , author=. arXiv preprint arXiv:2604.02869 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Lu, Jiarui and Holleis, Thomas and Zhang, Yizhe and Aumayer, Bernhard and Nan, Feng and Bai, Felix and Ma, Shuang and Ma, Shen and Li, Mengyu and Yin, Guoli and Wang, Zirui and Pang, Ruoming , journal=

[23] [23]

and Kapanipathi, Pavan , journal=

Basu, Kinjal and Abdelaziz, Ibrahim and Kate, Kiran and Agarwal, Mayank and Crouse, Maxwell and Rizk, Yara and Bradford, Kelsey and Munawar, Asim and Kumaravel, Sadhana and Goyal, Saurabh and Wang, Xin and Lastras, Luis A. and Kapanipathi, Pavan , journal=

[24] [24]

Chen, Chen and Hao, Xinlong and Liu, Weiwen and Huang, Xu and Zeng, Xingshan and Yu, Shuai and Li, Dexun and Wang, Shuai and Gan, Weinan and Huang, Yuefeng and Liu, Wulong and Wang, Xinzhi and Lian, Defu and Yin, Baoqun and Wang, Yasheng and Liu, Wu , journal=

[25] [25]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

[27] [27]

arXiv preprint arXiv:2509.13311 , year=

Towards General Agentic Intelligence via Environment Scaling , author=. arXiv preprint arXiv:2509.13311 , year=

work page arXiv

[28] [28]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large Language Model Connected with Massive APIs , author=. arXiv preprint arXiv:2305.15334 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Li, Minghao and Song, Feifan and Yu, Bowen and Yu, Haiyang and Li, Zhoujun and Huang, Fei and Li, Yongbin , booktitle=

[30] [30]

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle=

[31] [31]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year=

Budzianowski, Pawe. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , year=

2018

[32] [32]

Proceedings of the AAAI Conference on Artificial Intelligence , year=

Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

[33] [33]

Taskmaster-1: Toward a realistic and diverse dialog dataset , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

2019

[34] [34]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Identifying the Risks of LM Agents with an LM-Emulated Sandbox , author=. arXiv preprint arXiv:2309.15817 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Agent-safetybench: Evaluating the safety of llm agents , author=. arXiv preprint arXiv:2412.14470 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases , author=. arXiv preprint arXiv:2306.05301 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

arXiv preprint arXiv:2407.03502 , year=

Agentinstruct: Toward generative teaching with agentic flows , author=. arXiv preprint arXiv:2407.03502 , year=

work page arXiv

[38] [38]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

2024

[42] [42]

International Conference on Learning Representations , volume=

Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

[43] [43]

Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yao and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel , journal=

[44] [44]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Workarena: How capable are web agents at solving common knowledge work tasks? , author=. arXiv preprint arXiv:2403.07718 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Appworld: A controllable world of apps and people for benchmarking interactive coding agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[46] [46]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=

[47] [47]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal=

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal=

[48] [48]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

Multi-trait User Simulation with Adaptive Decoding for Conversational Task Assistants , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

2024

[49] [49]

is_unstable

Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models , author=. arXiv preprint arXiv:2502.12813 , year=

work page arXiv

[50] [50]

Computer Speech & Language , volume=

Prompting Large Language Models for User Simulation in Task-Oriented Dialogue Systems , author=. Computer Speech & Language , volume=. 2025 , publisher=

2025

[51] [51]

IEEE Transactions on Computational Social Systems , volume=

Are Current Task-Oriented Dialogue Systems Able to Satisfy Impolite Users? , author=. IEEE Transactions on Computational Social Systems , volume=. 2025 , doi=

2025