Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents
Pith reviewed 2026-05-16 11:11 UTC · model grok-4.3
The pith
Fine-tuning lightweight LLMs on successful tool trajectories from a synthesis pipeline boosts performance on complex user intents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By first collecting valid tool-call trajectories via multi-turn exploration and then adapting them into controlled user tasks, the Trajectory2Task approach produces data that supports both evaluation and fine-tuning. Using successful trajectories from this process to fine-tune lightweight LLMs results in consistent improvements across ambiguous, changing, and infeasible intent conditions, along with stronger generalization to unseen tool-use domains.
What carries the argument
The Trajectory2Task pipeline, which conducts multi-turn exploration to produce valid tool-call trajectories and then converts them into verifiable user-facing tasks with controlled intent adaptations.
If this is right
- Fine-tuned lightweight LLMs achieve higher success rates on tasks with ambiguous user intents.
- Models show improved ability to manage requests whose intent changes across conversation turns.
- Performance rises on infeasible requests that violate policy constraints.
- The same models generalize better to tool-use domains absent from the fine-tuning data.
- Closed-loop evaluation becomes feasible at scale because every generated task remains verifiable.
Where Pith is reading between the lines
- The pipeline could be looped with deployed agents so that real interactions continuously supply new trajectories for further fine-tuning.
- Similar trajectory-to-task conversion might apply to training agents in other interactive settings such as web navigation or code debugging.
- If the synthetic data matches real distributions closely enough, it could reduce the volume of human-annotated examples needed for robust agent development.
Load-bearing premise
The synthesized trajectories and controlled intent adaptations accurately capture the distribution and difficulty of real-world complex user intents without introducing artifacts that inflate measured improvements.
What would settle it
Testing the fine-tuned models on a held-out collection of actual customer conversations with ambiguous or policy-constrained requests and finding no performance gain or worse generalization than the base models would falsify the central claim.
Figures
read the original abstract
Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks. In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented. To bridge the gap, we present Trajectory2Task, a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures. Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger tool-calling ability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Trajectory2Task, a pipeline that first rolls out multi-turn tool-call trajectories via exploration and then applies controlled intent adaptations to convert them into verifiable user-facing tasks covering ambiguous, changing, and infeasible intents. It benchmarks seven LLMs on the resulting tasks (reporting frequent failures), then fine-tunes lightweight LLMs on successful trajectories and claims consistent improvements across all three conditions plus better generalization to unseen tool-use domains.
Significance. If the synthesized tasks faithfully capture the distribution of real-world complex intents without systematic artifacts, the work would offer a scalable, closed-loop method for generating training data and evaluation benchmarks, addressing a clear gap in tool-calling research that has focused on idealized fixed tasks.
major comments (1)
- [Abstract] Abstract: the central claim of 'consistent improvements across all three conditions, along with better generalization to unseen tool-use domains' is presented without any quantitative metrics, baselines, statistical tests, or details on verification procedures, which is load-bearing for interpreting the fine-tuning results as evidence of stronger tool-calling ability rather than adaptation to pipeline-specific patterns.
minor comments (1)
- The description of the 'controlled intent adaptations' step would benefit from concrete examples or pseudocode to clarify how ambiguity, change, and infeasibility are introduced while preserving verifiability.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the abstract. We address the concern point-by-point below and will revise the manuscript to improve clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'consistent improvements across all three conditions, along with better generalization to unseen tool-use domains' is presented without any quantitative metrics, baselines, statistical tests, or details on verification procedures, which is load-bearing for interpreting the fine-tuning results as evidence of stronger tool-calling ability rather than adaptation to pipeline-specific patterns.
Authors: We agree that the abstract would be strengthened by including key quantitative results. The main paper reports these details in Sections 4.2 and 5 (Tables 2-4), where we compare fine-tuned models against the original base LLMs using the same evaluation protocol. In the revised abstract we will add the specific average improvements (approximately +12-25% absolute success rate across the three conditions) and note the generalization lift on held-out tool domains. Baselines are the untuned models under identical prompting; we will also reference the verification procedure (automated trajectory validity checks plus human review of 200 samples for intent fidelity) already described in Section 3.2. We did not run formal statistical tests in the original submission but will add paired t-test p-values in the revision to address this directly. revision: yes
Circularity Check
No significant circularity in derivation or results
full rationale
The paper describes an empirical pipeline that first generates trajectories via multi-turn rollouts, applies controlled intent adaptations to create verifiable tasks, benchmarks LLMs on those tasks, and then fine-tunes on successful trajectories with evaluation on held-out rollouts. No equations, fitted parameters, or self-citations are presented as load-bearing for the central claims. The reported improvements and generalization are measured outcomes on independently generated evaluation data rather than quantities defined by or forced from the training inputs themselves, so the derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Reference graph
Works this paper leans on
-
[1]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901. Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, and 1 others. 2025. Mcp-flow: Facilitating llm agents to master real- world, diverse and scaling mcp tools. arX...
-
[12]
"direct_communication_info": The information that the agent should include in the communication with the user, a list of strings. This is optional and only includes minimal information. ,→ ,→ ,→ ,→ ,→ Good example: { "reason_for_call": "You received your order #W2378156 and wish to exchange the mechanical keyboard for a similar one but with clicky switche...
-
[13]
These queries can either be out of tool constraints or against the domain policies
As a real user, you don 't know the policy, you make unrealistic queries that are not possible to be handled by the agent. These queries can either be out of tool constraints or against the domain policies. ,→ ,→ ,→ ,→ ,→
-
[24]
"infeasible_reason": The reason why the task is infeasible to be handled by the agent. ,→ ,→
-
[25]
"actions_should_not_taken": The actions that the agent should definitely not take under this user scenario. (e.g., "cancel the order") ,→ ,→ ,→
-
[26]
"actions_should_be_taken": The actions that the agent should take under this user scenario (e.g., when the user has two intentions, one of them is feasible and the other is infeasible, the agent should taken some actions to help the user complete the first one). However, the conversation may terminate immediately after handling the first infeasible reques...
-
[27]
"direct_communication_info": The information that the agent should include in the communication with the user, a list of strings. This is optional and only includes minimal information. Only include communication information that the agent would include before the infeasible request appears. ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→
-
[28]
"nl_assertions": The natural language assertions that the agent should follow (e.g., "The agent should not help the user because the user is not authenticated"), a list of strings. This is optional. ,→ ,→ ,→ ,→ ,→ Good example: { "reason_for_call": "You received an order with order id #W2378156 and wish to return the items. You don't know the user informa...
-
[29]
The task should remain realistic
The task you generated should involve changing user intent, which could be either change of mind when dealing with a specific issue. The task should remain realistic. ,→ ,→ ,→ ,→
-
[30]
The intent must contain **enough information** for the agent to know exactly which action to take (e.g., which order to cancel or which item to exchange), but **omit unnecessary enumeration** of all order items or attributes if the order or item can already be uniquely identified. ,→ ,→ ,→ ,→ ,→ ,→ ,→
-
[31]
Include only minimal, distinguishing descriptors (e.g., product type or compatibility) to make the action uniquely identifiable. But avoid vague phrasing such as "some items", "one of the products", or "a different item". ,→ ,→ ,→ ,→ ,→ ,→
-
[32]
the glass water bottle in your latest order
You can omit explicit IDs (e.g., `order_id`, `item_id`) but still provide natural, unique identifiers (e.g., "the glass water bottle in your latest order"). ,→ ,→ ,→ ,→
-
[33]
You don 't know the policy, so you may have unrealistic queries.,→
-
[34]
State in `task_instructions` that the user wants to know or act based on that amount
If the trajectory includes a calculation result (e.g., refund amount, price difference, remaining balance), you can choose to include the **exact numeric result** in `direct_communication_info`. State in `task_instructions` that the user wants to know or act based on that amount. Do **not** include intermediate calculations or unnecessary numbers. ,→ ,→ ,...
-
[35]
Act like a real user
-
[36]
Every successful write action in the trajectory **must be explicitly reflected** in the user 's intent. (e.g. cancel_pending_order, modify_pending_order). The scenario should make these actions *necessary*, not incidental. ,→ ,→ ,→ ,→ ,→ ,→ Please provide ONLY a JSON object with the following three fields:,→
-
[37]
"reason_for_call": A detailed description of why the user is calling, written as if you are the user (use "You" or "I"). Include specific details like order numbers, product names, desired changes, preferences, and any conditions or alternatives. ,→ ,→ ,→ ,→ ,→ ,→ ,→
-
[38]
"known_info": What information the user knows and can provide (e.g., "You are [Name] in zip code [ZIP]", order numbers, etc.). The user should at least either know their name and zip code, or know their email address. ,→ ,→ ,→ ,→ ,→ ,→
-
[39]
"unknown_info": What information the user does NOT know or cannot remember (e.g., "You do not remember your email address") ,→ ,→ ,→
-
[40]
"direct_communication_info": The information that the agent should include in the communication with the user, a list of strings. This is optional and only includes minimal information. ,→ ,→ ,→ ,→ ,→ Good example: { "reason_for_call": "You received your order #W2378156 and wish to exchange the mechanical keyboard for a similar one but with clicky switche...
-
[42]
All REQUIRED USER-FACING COMMUNICATION is correctly provided ,→ ,→ - Your evaluation should focus on **final effects** , not on whether the agent followed an optimal or minimal path. ,→ ,→ ,→ # TASK DESCRIPTION Domain: {domain} User Scenario (from user 's perspective): - Reason for Call: {reason_for_call} - Known Information: {known_info} - Unknown Inform...
-
[43]
The actions does not change the FINAL DATABASE STATE,→
-
[44]
All REQUIRED USER-FACING COMMUNICATION is correctly provided ,→ ,→ - Your evaluation should focus on **final effects** , not on whether the agent followed an optimal or minimal path. ,→ ,→ ,→ # TASK DESCRIPTION Domain: {domain} User Scenario (from user 's perspective): - Reason for Call: {reason_for_call} - Known Information: {known_info} - Unknown Inform...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.