Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Lide Tan; Lu Xu; Xiang Cheng; Xiangwen Zhang; Xin Li; Yong Liu; Yulan Hu; Zheng Pan

arxiv: 2512.22673 · v3 · submitted 2025-12-27 · 💻 cs.AI

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Xiang Cheng , Yulan Hu , Xiangwen Zhang , Lu Xu , Lide Tan , Zheng Pan , Xin Li , Yong Liu This is my paper

Pith reviewed 2026-05-16 18:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords travel planningLLM agentsbenchmarkmulti-turn conversationtool usepreference elicitationcapability boundaries

0 comments

The pith

TravelBench evaluates LLMs on realistic multi-turn travel planning, tool use, and boundary recognition using real-world data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TravelBench to better match actual travel planning needs by collecting authentic user queries, preferences, and tools. It splits the evaluation into three parts: single-turn tasks for solving problems alone, multi-turn for drawing out hidden preferences through conversation, and unsolvable cases to test when agents should admit limits. Tests on current models show they handle some parts well but not others, and checks confirm the benchmark gives consistent results. A sandbox with ten fixed travel tools makes runs repeatable without live API calls.

Core claim

We propose TravelBench, a benchmark for truly real-world travel planning. We collect user queries, user preferences, and tools from real scenarios, and construct three subtasks -- Single-Turn, Multi-Turn, and Unsolvable -- to evaluate agents' three core capabilities in real settings: (1) solving problems independently, (2) interacting with users to elicit implicit preferences, and (3) recognizing the capability boundaries. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment which integrates ten travel-related tools. We evaluate multiple LLMs on TravelBench and find that even advanced models exhibit imbalanced performa

What carries the argument

The three subtasks Single-Turn, Multi-Turn, and Unsolvable, supported by a sandbox environment with ten cached travel-related tools for stable and reproducible evaluation.

Load-bearing premise

The collected real-world queries, preferences, and cached tool results sufficiently represent the full range of practical travel planning problems and that the three subtasks adequately capture the core agent capabilities.

What would settle it

A new set of travel queries where multiple advanced models achieve uniformly high performance across all three subtasks in the provided sandbox would contradict the reported imbalance.

Figures

Figures reproduced from arXiv: 2512.22673 by Lide Tan, Lu Xu, Xiang Cheng, Xiangwen Zhang, Xin Li, Yong Liu, Yulan Hu, Zheng Pan.

**Figure 1.** Figure 1: Over-view of TravlBench. The user is simulated by an LLM with a user profile and contextual information, while the agent is given the same context and access to external tools. We define three settings: Single-turn, where the agent may perform multi-step tool use without interacting with the user; Multi-turn, where the agent may both use tools and conduct multi-round dialogue to request missing information… view at source ↗

**Figure 2.** Figure 2: Distribution of LLM scores across different [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution plot of the average number of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of reasoning steps in the Single [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of interaction turns in the Multi [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Data distribution across three sub-tasks: (a) Single-turn, (b) Multi-turn, and (c) Unsolvable tasks. Each [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of cached tool calls in our sandbox environment. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: A representative instance of our data. It showcases the raw query (preserving colloquial grammar), the [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for profile refinement, preference summarization, and privacy desensitization. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt Template for Unsolvability Determination [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt Template for Travel Assistant Multi-Turn Subtask [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt Template for User Simulator 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt Template for Travel Assistant Single-Turn Subtask [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt Template for Travel Assistant Unsolvable Subtask [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt Template for Tool Simulator LLM-Judge Prompt Template for Single-Turn Subtask ## Task Description Conduct a **response quality evaluation ** for a dialogue that involves tool usage, assessing the model from three core capability dimensions. ## Evaluation Objective Based on the given dialogue content , analyze the model's response along the following three dimensions: 1. **Tool Usage and Planning Ca… view at source ↗

**Figure 16.** Figure 16: LLM-Judge Prompt Template for Single-Turn Subtask [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

**Figure 17.** Figure 17: LLM-Judge Prompt Template for Multi-Turn Subtask [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt Template for Meta LLM Judge 40 [PITH_FULL_IMAGE:figures/full_fig_p040_18.png] view at source ↗

read the original abstract

Travel planning is a natural real-world task to test large language models' (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of evaluation of agents' capability boundaries. To mitigate these gaps, we propose $\textbf{TravelBench}$, a benchmark for $\textit{truly real-world}$ travel planning. We collect user queries, user preferences, and tools from real scenarios, and construct three subtasks -- $\textit{Single-Turn}$, $\textit{Multi-Turn}$, and $\textit{Unsolvable}$ -- to evaluate agents' three core capabilities in real settings: (1) solving problems independently, (2) interacting with users to elicit implicit preferences, and (3) recognizing the capability boundaries. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment which integrates ten travel-related tools, enabling agents to combine these tools to solve most practical travel planning problems. We evaluate multiple LLMs on TravelBench and find that even advanced models exhibit imbalanced performance across different capabilities. Our further systematic verification demonstrates the stability of the proposed benchmark. TravelBench provides a practical and reproducible benchmark to advance research on LLM agents for real-world travel planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TravelBench adds multi-turn preference elicitation and unsolvable-case detection to travel agent benchmarks using real queries and a cached-tool sandbox, but its representativeness claim rests on limited coverage details.

read the letter

The main thing to know is that this paper builds TravelBench around three subtasks for LLM agents in travel planning: single-turn independent solving, multi-turn interaction to draw out implicit preferences, and recognizing when a query falls outside the agent's capabilities. They pull queries and preferences from real scenarios, cache outputs from ten travel tools, and run stability checks to support reproducible evaluation. This setup moves past the single-turn itinerary focus in earlier work and shows uneven model performance across the three areas.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TravelBench as a benchmark for real-world multi-turn travel planning tasks that require tool use. It gathers queries, preferences, and tools from actual scenarios to create three subtasks: Single-Turn for independent solving, Multi-Turn for preference elicitation, and Unsolvable for boundary recognition. A sandbox environment with ten pre-cached travel tools supports reproducible agent evaluations. The authors evaluate several LLMs, observe imbalanced capabilities, and verify the benchmark's stability through systematic checks.

Significance. TravelBench addresses important limitations in existing travel planning benchmarks by incorporating multi-turn interactions and explicit boundary testing. The construction from real-world data and the provision of a stable sandbox with cached tool results are notable strengths that enhance reproducibility. If the dataset proves representative, this benchmark could serve as a standard for assessing and improving LLM agents in practical planning scenarios, particularly by revealing imbalances in model capabilities.

major comments (1)

[Data Collection] The central claim that TravelBench is a 'truly real-world' benchmark depends on the representativeness of the collected user queries, preferences, and cached tool results. However, no quantitative metrics are reported on aspects such as query diversity, geographic distribution, preference complexity, or comparisons to external travel planning corpora. This omission is load-bearing because the observed imbalanced performance across subtasks could be influenced by dataset-specific characteristics rather than reflecting general agent capabilities.

minor comments (2)

[Abstract] The abstract mentions data collection 'from real scenarios' and 'systematic verification' but provides limited specifics on filtering processes or exact metric definitions used in the subtasks.
[Evaluation] Clarify how the success criteria for the Unsolvable subtask are operationalized to ensure they accurately measure boundary recognition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on TravelBench. The point on data representativeness is well-taken, and we address it directly below.

read point-by-point responses

Referee: [Data Collection] The central claim that TravelBench is a 'truly real-world' benchmark depends on the representativeness of the collected user queries, preferences, and cached tool results. However, no quantitative metrics are reported on aspects such as query diversity, geographic distribution, preference complexity, or comparisons to external travel planning corpora. This omission is load-bearing because the observed imbalanced performance across subtasks could be influenced by dataset-specific characteristics rather than reflecting general agent capabilities.

Authors: We agree that explicit quantitative metrics would strengthen the representativeness claim. The original manuscript describes collection from real user scenarios and APIs but does not include diversity statistics or external comparisons. In the revision we will add a dedicated subsection (or appendix) reporting: query topic distribution, geographic coverage (cities and countries represented), preference complexity (average constraints per query), and a qualitative/quantitative comparison to prior travel-planning corpora. We will also include an analysis showing that capability imbalances persist across data subsets and model families, reducing the likelihood that results are purely dataset artifacts. These additions directly address the load-bearing concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs TravelBench by collecting queries, preferences, and tool results from external real scenarios, then defines three subtasks (Single-Turn, Multi-Turn, Unsolvable) to probe independent solving, preference elicitation, and boundary recognition. Evaluation of LLMs and stability verification follow directly from running agents in the sandbox environment. No equations, fitted parameters, or self-citations reduce any claim to the inputs by construction; the benchmark rests on external data rather than internal re-derivation or renaming of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of collected real-world data and the sufficiency of the ten tools; no free parameters are fitted to produce the benchmark itself.

axioms (2)

domain assumption Real user queries and implicit preferences collected from actual scenarios can be faithfully represented in a benchmark without significant loss of realism.
Invoked to support the claim of 'truly real-world' coverage.
domain assumption The ten travel-related tools, when combined, enable agents to solve most practical travel planning problems.
Stated directly in the abstract as justification for the sandbox design.

pith-pipeline@v0.9.0 · 5567 in / 1306 out tokens · 45184 ms · 2026-05-16T18:47:37.899187+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
cs.CL 2026-05 unverdicted novelty 7.0

TransitLM is a large-scale dataset and benchmark for training LLMs to generate structurally valid map-free transit routes from origin-destination pairs.
TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents
cs.CY 2026-05 unverdicted novelty 7.0

TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

Juhyun Oh, Eunsu Kim, and Alice Oh

Deeptravel: An end-to-end agentic reinforce- ment learning framework for autonomous travel plan- ning agents.arXiv preprint arXiv:2509.21842. Juhyun Oh, Eunsu Kim, and Alice Oh. 2025. Flex- travelplanner: A benchmark for flexible plan- ning with language agents.arXiv preprint arXiv:2506.04649. 9 OpenAI. 2025. GPT-5.1. https://openai.com/ zh-Hans-CN/index/...

work page arXiv 2025
[2]

COMPASS: Benchmarking Constrained Optimization in LLM Agents

Compass: A multi-turn benchmark for tool- mediated planning & preference optimization.arXiv preprint arXiv:2510.07043. Yincen Qu, Huan Xiao, Feng Li, Gregory Li, Hui Zhou, Xiangying Dai, and Xiaoru Dai. 2025. Tripscore: Benchmarking and rewarding real-world travel plan- ning with fine-grained evaluation.arXiv preprint arXiv:2510.09011. Jie-Jing Shao, Bo-W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Kaimin Wang, Yuanzhe Shen, Changze Lv, Xiaoqing Zheng, and Xuanjing Huang. 2025. TripTailor: A real-world benchmark for personalized travel plan- ning. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 9705–9723, Vienna, Austria. Association for Computational ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations. Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, C...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

map_search_places: A large-coverage POI re- trieval tool that supportsnationwide search in China. It can search a wide range of place types (e.g., restaurants, hotels, attractions, shop- ping malls, hospitals, universities, airports, and railway stations) using keywords, categories, or addresses. It supports nearby search with a configurable radius, admin...

work page
[6]

Each category illustrates the data flow from theOriginalpool to theFilteredsubset and the finalSampledset used in our experiments

map_compute_routes: A routing tool that com- 11 App_Int Discovery Dyn_Info Plan_Dec Rules 100 101 102 103 Instance Count (Log Scale) 43 1944 726 2166 50 1 191 72 251 1 184 70 245 (a) Single-Turn Task App_Int Discovery Dyn_Info Plan_Dec Rules 42 1944 726 2166 49 2 523 11 286 1 2 309 11 177 1 (b) Multi-Turn Task App_Int Discovery Dyn_Info Plan_Dec Rules 42 ...

work page 1944
[7]

find a coffee shop that is close to my route

map_search_along_route: Searches for POIs along a planned route within a user-specified corridor. This is useful for needs such as “find a coffee shop that is close to my route” or “find a restroom near the highway on the way.” The tool first plans a base route and then returns candidate POIs that lie within the buffer region, together with detailed POI metadata

work page
[8]

It provides three strategies: balanced (overall best trade-off), minimize maximum distance (fairness-oriented), and minimize total distance (efficiency-oriented)

map_search_central_places: Recommends convenient meeting locations for multiple par- ticipants by optimizing spatial centrality. It provides three strategies: balanced (overall best trade-off), minimize maximum distance (fairness-oriented), and minimize total distance (efficiency-oriented). This supports realistic coordination scenarios (e.g., choosing a ...

work page
[9]

It returns ranked POIs with tags and short recommendation rationales, which is use- ful for recommendation-style travel planning

map_search_ranking_list: Retrieves curated local ranking lists for a given region and cat- egory (e.g., top-rated local eateries or popular attractions). It returns ranked POIs with tags and short recommendation rationales, which is use- ful for recommendation-style travel planning. Travel & Transportation Tools

work page
[10]

It supports multi- day queries to compare schedules and prices across adjacent dates

travel_search_flights: Searches domestic flight options between two cities. It supports multi- day queries to compare schedules and prices across adjacent dates. The tool returns struc- tured flight information such as flight number, airline, departure/arrival time, aircraft type, and price ranges

work page
[11]

It returns train num- ber, departure/arrival stations and time, travel duration, and ticket prices

travel_search_trains: Queries train and high- speed rail schedules between cities, also support- ing multi-day comparisons. It returns train num- ber, departure/arrival stations and time, travel duration, and ticket prices. Weather Tools

work page
[12]

12 Table 7: Field definitions for a TravelBench instance

weather_current_conditions: Retrieves real- time weather conditions for a specified location, including temperature, feels-like temperature, weather phenomena, wind direction/speed, and Air Quality Index (AQI). 12 Table 7: Field definitions for a TravelBench instance. Field Description trace_id A unique identifier for each instance. time The timestamp of ...

work page 2025
[13]

Information Retrieval Tools

weather_forecast_days: Provides multi-day forecasts (up to 5 days) for a location, support- ing both single-date and date-range queries. Information Retrieval Tools

work page
[14]

B.2 Tool-Cache Distribution Figure 7 shows the distribution of cached tool re- sponses in the sandbox, built from the 1,100 bench- mark instances

web_search: Performs open-domain web search for information beyond the scope of spatio-temporal tools, such as general facts, re- cent news, local regulations, and travel policies. B.2 Tool-Cache Distribution Figure 7 shows the distribution of cached tool re- sponses in the sandbox, built from the 1,100 bench- mark instances. The cache is dominated by POI...

work page
[15]

It explicitly defines which information may be kept and which must be re- moved

Figure 9 shows the prompt used foruser-profile de-identification. It explicitly defines which information may be kept and which must be re- moved. For personal background details, the prompt instructs the model to replace them with broad, non-identifying descriptions. We also provide an example to guide the model’s deci- sions, aiming to preserve user pre...

work page
[16]

It specifies a step-by-step analysis procedure and provides an example for each outcome, helping the model make correct feasibility judgments for complex queries

Figure 10 shows the prompt forquery feasibil- ity determination. It specifies a step-by-step analysis procedure and provides an example for each outcome, helping the model make correct feasibility judgments for complex queries

work page
[17]

Figure 11 shows the prompt for themulti-turn assistant. The agent is instructed to solve the task on its own whenever possible, ask the user questions only when key information is missing, avoid requesting the user to take actions outside the dialogue, and follow tool-use rules

work page
[18]

Figure 12 shows the prompt for theuser simula- tor. It enforces that the simulator replies strictly based on the provided user_profile, without inventing additional preferences, and defines clear conditions for ending the conversation

work page
[19]

The agent is instructed to solve the task without asking clarification questions, and to follow tool-use rules

Figure 13 shows the prompt for thesingle-turn assistant. The agent is instructed to solve the task without asking clarification questions, and to follow tool-use rules

work page
[20]

It is derived from the single-turn assistant prompt, with an explicit rule specifying when to output[Unsolved]

Figure 14 shows the prompt for handlinginfea- sible queries. It is derived from the single-turn assistant prompt, with an explicit rule specifying when to output[Unsolved]

work page
[21]

The model is instructed to follow the provided examples and generate tool outputs that are realistic and consistent in format

Figure 15 shows the prompt for thetool sim- ulator. The model is instructed to follow the provided examples and generate tool outputs that are realistic and consistent in format

work page
[22]

The judge first performs structured reasoning and then assigns compre- hensive scores under three dimensions

Figure 16 shows the prompt forjudging single- turn trajectories. The judge first performs structured reasoning and then assigns compre- hensive scores under three dimensions

work page
[23]

reason-then-score

Figure 17 shows the prompt forjudging multi- turn trajectories. It extends the single-turn judging prompt by adding a user-interaction di- mension, and evaluates trajectories under four dimensions with the same “reason-then-score” structure

work page
[24]

trace_id

Figure 18 shows the prompt for themeta-judge. It asks the model to audit an existing evaluation from multiple perspectives and correct poten- tially biased or low-quality judgments. 14 An Example of Our Datas (Json Format) " trace_id ": "212d7e0f17612735295674131d099a", "time": "2025−10−24 10:38:49.885", "query": "Um I I ' m so sleepy I ' m dying I ' ll j...

work page 2025
[25]

Extract and summarize basic information and interest preferences − Basic information : − Allowed to keep and output : resident city , administrative district code, and whether the user owns a car . − You may additionally *randomly* enrich the profile with a small amount of **broad, non− identifying ** background (e.g ., household size structure , lifestyl...

work page
[26]

− Only generalized place categories or city −level descriptions may be retained

Strictly remove personal sensitive information (must delete / generalize ) − Remove any fields / content that can precisely identify a person or location , including but not limited to : − Street addresses , building / unit numbers, community/compound names, road names, latitude / longitude coordinates , license plate numbers, employer/company names, etc....

work page
[27]

Go to that mall

Allowed information boundary − Allowed: resident city and administrative district code (example:``Beijing | 110105'') − Allowed: whether the user owns a car − Allowed: current location (as an immediate activity location , not treated as long−term residence privacy ; still avoid unit / building /community−level details ) − Allowed: place−category preferenc...

work page 2024
[28]

meals/ rest / attractions /accommodation,

Only do what the user asked: stay strictly focused on the user ' s original query and any clearly provided follow−up requirements . Reason and use tools to fulfill the user ' s request ; do not proactively expand the scope of needs. − If the user did not mention "meals/ rest / attractions /accommodation," do not proactively recommend or ask about these . ...

work page
[29]

progress

Be problem−solving oriented : every turn must make "progress" ( obtain key information or produce usable results ) . Avoid vague advice and long re−statements

work page
[30]

Use context before asking : never ask for information that can be obtained directly from [ context ]

work page
[31]

cannot call tools / cannot produce an executable result / the user intent is unclear

Minimal questioning : only ask when you "cannot call tools / cannot produce an executable result / the user intent is unclear ."

work page
[32]

for a better experience

Stay on−topic / no scope expansion: do not add dimensions "for a better experience" (e.g ., budget, taste preferences , itinerary intensity , nearby attractions ) unless they directly determine the result of the current task

work page
[33]

No repetition / no bombardment: if the user has already answered or clearly has no preference , do not ask the same dimension again ; do not repeat the same process more than once

work page
[34]

check it yourself / open an app / click a link / call / compare prices / search on a map

Converge quickly : once you have provided an executable result ( directly navigable / bookable / clear 23 next steps ) , stop further questioning and extra suggestions . [Assumptions About User Capability ] − The user has no ability to operate tools / search / place orders : do not ask the user to "check it yourself / open an app / click a link / call / c...

work page
[35]

Missing information would make the result non−executable or highly likely to be wrong (e.g ., the user ' s wording is unclear ) ; and

work page
[36]

It cannot be resolved via context or reasonable defaults ; and

work page
[37]

plausible −sounding but unqueried

The question is directly related to the user ' s original query (e.g ., their preference relevant to the query) . Otherwise, asking is prohibited . [Tool Usage Requirements] − Use tools whenever possible : as long as the information is sufficient and there is a usable tool that can reduce uncertainty / increase truthfulness ( flight / train schedules and ...

work page
[38]

Identity and perspective : always speak as the "user "; do not refer to yourself as an AI/model/system; do not explain or mention any rules / profile sources

work page
[39]

user profile

Faithfulness : your needs, preferences , budget, timing , transportation modes, destination inclinations , and preferences for food/accommodation/ activities , etc . may only come from the [" user profile "]. Do not add settings outside the profile or infer anything on your own

work page
[40]

natural spoken language

If it is not mentioned, it is unknown: − If the assistant asks about information / preferences / constraints that are not included in the profile , you must answer in " natural spoken language" that you do not know, and you must not add specific preferences or hard constraints , e.g ., "I don' t have any particular preference / anything is fine / you can ...

work page
[41]

go check/go place an order /open some app/ click a link / search it yourself

No tool capability : − You do not have any ability to search /compare prices / place orders /grab tickets /open links / search maps/ call by phone. − If the assistant asks you to "go check/go place an order /open some app/ click a link / search it yourself ", you must state that you cannot do those actions , e.g ., "I can' t operate those on my side; just...

work page
[42]

Natural dialogue : respond concisely and colloquially like a real user ; when necessary, ask follow−up clarification questions that are directly related to the current plan

work page
[43]

Consistency : once you state some information based on the profile (such as dates , budget, preferences ) , you must not contradict yourself later , unless the profile itself allows changes

work page
[44]

repeated confirmations / repeated restatements /back−and−forth pleasantries

Forced convergence and ending ( important ) : you must proactively avoid " repeated confirmations / repeated restatements /back−and−forth pleasantries ". − When the travel assistant has already provided an executable plan ( for example, clearly specifying : transportation / route / train or flight / hotel options / store name and address and next steps ) ...

work page
[45]

Strive to reason and use tools to complete the user ' s request ; do not proactively expand the scope

Only do what the user asks : strictly focus on the user ' s original query and any explicitly added requirements in follow−up messages. Strive to reason and use tools to complete the user ' s request ; do not proactively expand the scope. − If the user does not mention ' meals/ rest / attractions /accommodation', do not proactively recommend or ask about ...

work page
[46]

Avoid vague suggestions and long rephrasing

Be solution −oriented : every turn must produce ' progress ' ( obtain key information or produce a usable result ) . Avoid vague suggestions and long rephrasing

work page
[48]

Do not go off−topic / do not expand: do not add new dimensions for a ' better experience ' (such as budget , taste preferences , trip intensity , nearby attractions , etc .) unless it directly determines the result of the current task

work page
[49]

travel assistant

Converge promptly: once you have provided an executable result (can navigate directly / can book / clear next step ) , stop immediately and do not continue asking or extending suggestions . [Assumptions About User's Ability ] − The user has no ability to operate tools / search / place orders : do not ask the user to ' check it yourself / open an app / cli...

work page
[50]

meals/ rest / attractions /accommodation

Only do what the user asks : strictly focus on the user ' s original query and any explicitly added requirements in follow−up messages. Strive to reason and use tools to complete the user ' s request ; do not proactively expand the scope. − If you believe there is no clear intent / key context information is missing / relevant tools are missing (whether d...

work page
[51]

progress

Be solution −oriented : every turn must produce " progress " ( obtain key information or produce a usable result ) . Avoid vague suggestions and long rephrasing

work page
[52]

You may not ask the user questions : make every effort to obtain information from [ context ], or rely on tools to get what is necessary

work page
[53]

better experience

Do not go off−topic / do not expand: do not add new dimensions for a " better experience" (such as budget , taste preferences , trip intensity , nearby attractions , etc .) unless it directly determines the result of the current task

work page
[54]

check it yourself / open an app / click a link / call / compare prices / search on a map

Converge promptly: once you have provided an executable result (can navigate directly / can book / clear next step ) , stop immediately and do not continue asking or extending suggestions . [Assumptions About User's Ability ] − The user has no ability to operate tools / search / place orders : do not ask the user to "check it yourself / open an app / clic...

work page
[55]

Based on the provided real examples, understand the tool ' s output format and content characteristics

work page
[56]

Generate reasonable simulated results based on the input parameters

work page
[57]

Ensure the output format is consistent with the examples

work page
[58]

The generated content must conform to the tool ' s business logic and real −world scenarios

work page
[59]

Directly return the simulated result ; do not add any extra notes , explanations , or markdown formatting

work page
[60]

Please generate reasonable simulated results based on the tool definition and parameters

Do not return a JSON wrapper; directly return the content that the tool itself should return EXAMPLES_SECTION_TEMPLATE = Below are {num_examples} real invocation examples for reference : SINGLE_EXAMPLE_TEMPLATE = Example {index}: Input parameters : {params} Output result : { result } NO_EXAMPLES_TEMPLATE = Note: No historical examples were found for {tool...

work page
[61]

Similar invocation parameters should produce similar simulated results

Be sure to refer to the real invocation examples; some information may come directly from the examples provided to you. Similar invocation parameters should produce similar simulated results

work page
[62]

The content must conform to the tool ' s business logic and real −world scenarios

work page
[63]

If the result is a list type , generate several reasonable entries

work page
[64]

Numerical values must be within reasonable ranges

work page
[65]

must comply with the constraints in the parameters

Times, dates , etc . must comply with the constraints in the parameters

work page
[66]

## Evaluation Objective Based on the given dialogue content , analyze the model's response along the following three dimensions:

Directly return the result content ; do not add any explanations or formatting wrappers Figure 15: Prompt Template for Tool Simulator LLM-Judge Prompt Template for Single-Turn Subtask ## Task Description Conduct a **response quality evaluation ** for a dialogue that involves tool usage, assessing the model from three core capability dimensions. ## Evaluat...

work page
[67]

**Tool Usage and Planning Capability ** − Whether the model fully understands the relationship between the user ' s request and the available toolset ; whether the tool − calling trajectory is clear , reasonable , and accurate ; and whether the tool parameters are filled in appropriately and correctly

work page
[68]

**Summarization and Extraction Capability ** − After obtaining the user ' s query and the tool function ' s returned response , whether the model can selectively extract the most critical information (such as required function parameters) based on the available and historical information , while avoiding fabricating facts or inventing data

work page
[69]

locate first / search first , then conclude

**Final Answer Description and Presentation Capability ** − After completing planning and receiving tool return results , whether the final answer presents the information relevant to the user ' s needs clearly , accurately , and concisely . ## Core Mandatory Constraints You must treat the following as **primary inspection items throughout all three evalu...

work page
[70]

estimated arrival at 15:47

The following information types **must come from tools or context **, and must not be estimated based on common knowledge or memory: − Precise times (e.g ., " estimated arrival at 15:47", " takes 6 minutes") ; − Distances , mileage, congestion length ; − Prices or fees ( taxi fare , ticket prices , airfare , tolls , etc .) ; − Real−time or date− specific ...

work page
[71]

helpful

If a tool does not return certain data , but the assistant still provides seemingly " helpful " concrete values (e.g ., "a taxi costs about 12−15 Yuan", "today is 22 Celsius and sunny"), this should be considered hallucination , not a bonus. 30

work page
[72]

fabricated parameters

Geographic / location − related hard constraints ( applicable to POI, administrative region , nearby, range/ radius searches , etc .) : − Key parameters such as center −point coordinates , radius , administrative region , and city must come from: explicit user input / existing context / tool returns ; otherwise they are considered " fabricated parameters ...

work page
[73]

**Tool Usage and Planning Ability ** − Whether the model fully understands the relationship between user needs and the toolset ; whether the tool invocation trajectory is clear , reasonable , and accurate ; whether the planning of tool calls is correct ; and whether tool parameters are filled in appropriately and correctly

work page
[74]

**Summarization and Extraction Ability ** − After obtaining the user ' s query and tool function responses , whether the model can selectively extract the most important information (e.g ., required function parameters) based on available and historical information , avoiding arbitrary fabrication of facts or data

work page
[75]

**Final Answer Description and Presentation Ability ** − After completing planning and receiving tool results , whether the final response clearly , accurately , and concisely presents content relevant to the user ' s needs, and whether it appropriately interacts with or provides feedback to the user (e.g ., requesting more precise information or suggesti...

work page
[76]

locate / search before concluding

**User Interaction and Follow−up Ability ** − When information is insufficient , ambiguous, or tool results are abnormal, whether the model can ask necessary and high−value questions with minimal user disruption ; whether it prioritizes inference or tool usage to supplement information ; and whether follow−up questions stay aligned with the user ' s origi...

work page
[77]

estimated arrival at 15:47

The following information types must come from tools or context and must not be estimated from common knowledge or memory: * Precise times (e.g ., " estimated arrival at 15:47", " takes 6 minutes") ; * Distances , mileage, congestion lengths ; * Prices or costs ( taxi fares , tickets , airfares , tolls , etc .) ; * Real−time or date− specific weather, tem...

work page
[78]

helpful

If tools do not return certain data , but the assistant provides seemingly " helpful " specific numbers (e. g ., " taxi costs about 12−15 Yuan", "today is sunny, 22 Celsius ") , this should be treated as hallucination , not a bonus. 34

work page
[79]

within / outside X km

Geographic/ location hard constraints ( applicable to POI, administrative areas , nearby/ radius searches ) : * Center coordinates , radius , administrative areas , cities must come from explicit user input , existing context , or tool responses ; otherwise , they are considered fabricated parameters . * It is not allowed to assert "within / outside X km"...

work page
[80]

**Scoring Accuracy**: Whether the ratings for each dimension genuinely reflect the model's actual performance in the conversation ; whether there is any obvious overestimation or underestimation . 37

work page
[81]

**Reasoning and Evidence Chain**: Whether the evaluation rationale is logically clear , traceable , and grounded in key evidence ( conversation turns / tool calls / tool outputs ) , rather than vague or generic judgments

work page

Showing first 80 references.

[1] [1]

Juhyun Oh, Eunsu Kim, and Alice Oh

Deeptravel: An end-to-end agentic reinforce- ment learning framework for autonomous travel plan- ning agents.arXiv preprint arXiv:2509.21842. Juhyun Oh, Eunsu Kim, and Alice Oh. 2025. Flex- travelplanner: A benchmark for flexible plan- ning with language agents.arXiv preprint arXiv:2506.04649. 9 OpenAI. 2025. GPT-5.1. https://openai.com/ zh-Hans-CN/index/...

work page arXiv 2025

[2] [2]

COMPASS: Benchmarking Constrained Optimization in LLM Agents

Compass: A multi-turn benchmark for tool- mediated planning & preference optimization.arXiv preprint arXiv:2510.07043. Yincen Qu, Huan Xiao, Feng Li, Gregory Li, Hui Zhou, Xiangying Dai, and Xiaoru Dai. 2025. Tripscore: Benchmarking and rewarding real-world travel plan- ning with fine-grained evaluation.arXiv preprint arXiv:2510.09011. Jie-Jing Shao, Bo-W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Kaimin Wang, Yuanzhe Shen, Changze Lv, Xiaoqing Zheng, and Xuanjing Huang. 2025. TripTailor: A real-world benchmark for personalized travel plan- ning. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 9705–9723, Vienna, Austria. Association for Computational ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations. Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, C...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

map_search_places: A large-coverage POI re- trieval tool that supportsnationwide search in China. It can search a wide range of place types (e.g., restaurants, hotels, attractions, shop- ping malls, hospitals, universities, airports, and railway stations) using keywords, categories, or addresses. It supports nearby search with a configurable radius, admin...

work page

[6] [6]

Each category illustrates the data flow from theOriginalpool to theFilteredsubset and the finalSampledset used in our experiments

map_compute_routes: A routing tool that com- 11 App_Int Discovery Dyn_Info Plan_Dec Rules 100 101 102 103 Instance Count (Log Scale) 43 1944 726 2166 50 1 191 72 251 1 184 70 245 (a) Single-Turn Task App_Int Discovery Dyn_Info Plan_Dec Rules 42 1944 726 2166 49 2 523 11 286 1 2 309 11 177 1 (b) Multi-Turn Task App_Int Discovery Dyn_Info Plan_Dec Rules 42 ...

work page 1944

[7] [7]

find a coffee shop that is close to my route

map_search_along_route: Searches for POIs along a planned route within a user-specified corridor. This is useful for needs such as “find a coffee shop that is close to my route” or “find a restroom near the highway on the way.” The tool first plans a base route and then returns candidate POIs that lie within the buffer region, together with detailed POI metadata

work page

[8] [8]

It provides three strategies: balanced (overall best trade-off), minimize maximum distance (fairness-oriented), and minimize total distance (efficiency-oriented)

map_search_central_places: Recommends convenient meeting locations for multiple par- ticipants by optimizing spatial centrality. It provides three strategies: balanced (overall best trade-off), minimize maximum distance (fairness-oriented), and minimize total distance (efficiency-oriented). This supports realistic coordination scenarios (e.g., choosing a ...

work page

[9] [9]

It returns ranked POIs with tags and short recommendation rationales, which is use- ful for recommendation-style travel planning

map_search_ranking_list: Retrieves curated local ranking lists for a given region and cat- egory (e.g., top-rated local eateries or popular attractions). It returns ranked POIs with tags and short recommendation rationales, which is use- ful for recommendation-style travel planning. Travel & Transportation Tools

work page

[10] [10]

It supports multi- day queries to compare schedules and prices across adjacent dates

travel_search_flights: Searches domestic flight options between two cities. It supports multi- day queries to compare schedules and prices across adjacent dates. The tool returns struc- tured flight information such as flight number, airline, departure/arrival time, aircraft type, and price ranges

work page

[11] [11]

It returns train num- ber, departure/arrival stations and time, travel duration, and ticket prices

travel_search_trains: Queries train and high- speed rail schedules between cities, also support- ing multi-day comparisons. It returns train num- ber, departure/arrival stations and time, travel duration, and ticket prices. Weather Tools

work page

[12] [12]

12 Table 7: Field definitions for a TravelBench instance

weather_current_conditions: Retrieves real- time weather conditions for a specified location, including temperature, feels-like temperature, weather phenomena, wind direction/speed, and Air Quality Index (AQI). 12 Table 7: Field definitions for a TravelBench instance. Field Description trace_id A unique identifier for each instance. time The timestamp of ...

work page 2025

[13] [13]

Information Retrieval Tools

weather_forecast_days: Provides multi-day forecasts (up to 5 days) for a location, support- ing both single-date and date-range queries. Information Retrieval Tools

work page

[14] [14]

B.2 Tool-Cache Distribution Figure 7 shows the distribution of cached tool re- sponses in the sandbox, built from the 1,100 bench- mark instances

web_search: Performs open-domain web search for information beyond the scope of spatio-temporal tools, such as general facts, re- cent news, local regulations, and travel policies. B.2 Tool-Cache Distribution Figure 7 shows the distribution of cached tool re- sponses in the sandbox, built from the 1,100 bench- mark instances. The cache is dominated by POI...

work page

[15] [15]

It explicitly defines which information may be kept and which must be re- moved

Figure 9 shows the prompt used foruser-profile de-identification. It explicitly defines which information may be kept and which must be re- moved. For personal background details, the prompt instructs the model to replace them with broad, non-identifying descriptions. We also provide an example to guide the model’s deci- sions, aiming to preserve user pre...

work page

[16] [16]

It specifies a step-by-step analysis procedure and provides an example for each outcome, helping the model make correct feasibility judgments for complex queries

Figure 10 shows the prompt forquery feasibil- ity determination. It specifies a step-by-step analysis procedure and provides an example for each outcome, helping the model make correct feasibility judgments for complex queries

work page

[17] [17]

Figure 11 shows the prompt for themulti-turn assistant. The agent is instructed to solve the task on its own whenever possible, ask the user questions only when key information is missing, avoid requesting the user to take actions outside the dialogue, and follow tool-use rules

work page

[18] [18]

Figure 12 shows the prompt for theuser simula- tor. It enforces that the simulator replies strictly based on the provided user_profile, without inventing additional preferences, and defines clear conditions for ending the conversation

work page

[19] [19]

The agent is instructed to solve the task without asking clarification questions, and to follow tool-use rules

Figure 13 shows the prompt for thesingle-turn assistant. The agent is instructed to solve the task without asking clarification questions, and to follow tool-use rules

work page

[20] [20]

It is derived from the single-turn assistant prompt, with an explicit rule specifying when to output[Unsolved]

Figure 14 shows the prompt for handlinginfea- sible queries. It is derived from the single-turn assistant prompt, with an explicit rule specifying when to output[Unsolved]

work page

[21] [21]

The model is instructed to follow the provided examples and generate tool outputs that are realistic and consistent in format

Figure 15 shows the prompt for thetool sim- ulator. The model is instructed to follow the provided examples and generate tool outputs that are realistic and consistent in format

work page

[22] [22]

The judge first performs structured reasoning and then assigns compre- hensive scores under three dimensions

Figure 16 shows the prompt forjudging single- turn trajectories. The judge first performs structured reasoning and then assigns compre- hensive scores under three dimensions

work page

[23] [23]

reason-then-score

Figure 17 shows the prompt forjudging multi- turn trajectories. It extends the single-turn judging prompt by adding a user-interaction di- mension, and evaluates trajectories under four dimensions with the same “reason-then-score” structure

work page

[24] [24]

trace_id

Figure 18 shows the prompt for themeta-judge. It asks the model to audit an existing evaluation from multiple perspectives and correct poten- tially biased or low-quality judgments. 14 An Example of Our Datas (Json Format) " trace_id ": "212d7e0f17612735295674131d099a", "time": "2025−10−24 10:38:49.885", "query": "Um I I ' m so sleepy I ' m dying I ' ll j...

work page 2025

[25] [25]

Extract and summarize basic information and interest preferences − Basic information : − Allowed to keep and output : resident city , administrative district code, and whether the user owns a car . − You may additionally *randomly* enrich the profile with a small amount of **broad, non− identifying ** background (e.g ., household size structure , lifestyl...

work page

[26] [26]

− Only generalized place categories or city −level descriptions may be retained

Strictly remove personal sensitive information (must delete / generalize ) − Remove any fields / content that can precisely identify a person or location , including but not limited to : − Street addresses , building / unit numbers, community/compound names, road names, latitude / longitude coordinates , license plate numbers, employer/company names, etc....

work page

[27] [27]

Go to that mall

Allowed information boundary − Allowed: resident city and administrative district code (example:``Beijing | 110105'') − Allowed: whether the user owns a car − Allowed: current location (as an immediate activity location , not treated as long−term residence privacy ; still avoid unit / building /community−level details ) − Allowed: place−category preferenc...

work page 2024

[28] [28]

meals/ rest / attractions /accommodation,

Only do what the user asked: stay strictly focused on the user ' s original query and any clearly provided follow−up requirements . Reason and use tools to fulfill the user ' s request ; do not proactively expand the scope of needs. − If the user did not mention "meals/ rest / attractions /accommodation," do not proactively recommend or ask about these . ...

work page

[29] [29]

progress

Be problem−solving oriented : every turn must make "progress" ( obtain key information or produce usable results ) . Avoid vague advice and long re−statements

work page

[30] [30]

Use context before asking : never ask for information that can be obtained directly from [ context ]

work page

[31] [31]

cannot call tools / cannot produce an executable result / the user intent is unclear

Minimal questioning : only ask when you "cannot call tools / cannot produce an executable result / the user intent is unclear ."

work page

[32] [32]

for a better experience

Stay on−topic / no scope expansion: do not add dimensions "for a better experience" (e.g ., budget, taste preferences , itinerary intensity , nearby attractions ) unless they directly determine the result of the current task

work page

[33] [33]

No repetition / no bombardment: if the user has already answered or clearly has no preference , do not ask the same dimension again ; do not repeat the same process more than once

work page

[34] [34]

check it yourself / open an app / click a link / call / compare prices / search on a map

Converge quickly : once you have provided an executable result ( directly navigable / bookable / clear 23 next steps ) , stop further questioning and extra suggestions . [Assumptions About User Capability ] − The user has no ability to operate tools / search / place orders : do not ask the user to "check it yourself / open an app / click a link / call / c...

work page

[35] [35]

Missing information would make the result non−executable or highly likely to be wrong (e.g ., the user ' s wording is unclear ) ; and

work page

[36] [36]

It cannot be resolved via context or reasonable defaults ; and

work page

[37] [37]

plausible −sounding but unqueried

The question is directly related to the user ' s original query (e.g ., their preference relevant to the query) . Otherwise, asking is prohibited . [Tool Usage Requirements] − Use tools whenever possible : as long as the information is sufficient and there is a usable tool that can reduce uncertainty / increase truthfulness ( flight / train schedules and ...

work page

[38] [38]

Identity and perspective : always speak as the "user "; do not refer to yourself as an AI/model/system; do not explain or mention any rules / profile sources

work page

[39] [39]

user profile

Faithfulness : your needs, preferences , budget, timing , transportation modes, destination inclinations , and preferences for food/accommodation/ activities , etc . may only come from the [" user profile "]. Do not add settings outside the profile or infer anything on your own

work page

[40] [40]

natural spoken language

If it is not mentioned, it is unknown: − If the assistant asks about information / preferences / constraints that are not included in the profile , you must answer in " natural spoken language" that you do not know, and you must not add specific preferences or hard constraints , e.g ., "I don' t have any particular preference / anything is fine / you can ...

work page

[41] [41]

go check/go place an order /open some app/ click a link / search it yourself

No tool capability : − You do not have any ability to search /compare prices / place orders /grab tickets /open links / search maps/ call by phone. − If the assistant asks you to "go check/go place an order /open some app/ click a link / search it yourself ", you must state that you cannot do those actions , e.g ., "I can' t operate those on my side; just...

work page

[42] [42]

Natural dialogue : respond concisely and colloquially like a real user ; when necessary, ask follow−up clarification questions that are directly related to the current plan

work page

[43] [43]

Consistency : once you state some information based on the profile (such as dates , budget, preferences ) , you must not contradict yourself later , unless the profile itself allows changes

work page

[44] [44]

repeated confirmations / repeated restatements /back−and−forth pleasantries

Forced convergence and ending ( important ) : you must proactively avoid " repeated confirmations / repeated restatements /back−and−forth pleasantries ". − When the travel assistant has already provided an executable plan ( for example, clearly specifying : transportation / route / train or flight / hotel options / store name and address and next steps ) ...

work page

[45] [45]

Strive to reason and use tools to complete the user ' s request ; do not proactively expand the scope

Only do what the user asks : strictly focus on the user ' s original query and any explicitly added requirements in follow−up messages. Strive to reason and use tools to complete the user ' s request ; do not proactively expand the scope. − If the user does not mention ' meals/ rest / attractions /accommodation', do not proactively recommend or ask about ...

work page

[46] [46]

Avoid vague suggestions and long rephrasing

Be solution −oriented : every turn must produce ' progress ' ( obtain key information or produce a usable result ) . Avoid vague suggestions and long rephrasing

work page

[47] [48]

Do not go off−topic / do not expand: do not add new dimensions for a ' better experience ' (such as budget , taste preferences , trip intensity , nearby attractions , etc .) unless it directly determines the result of the current task

work page

[48] [49]

travel assistant

Converge promptly: once you have provided an executable result (can navigate directly / can book / clear next step ) , stop immediately and do not continue asking or extending suggestions . [Assumptions About User's Ability ] − The user has no ability to operate tools / search / place orders : do not ask the user to ' check it yourself / open an app / cli...

work page

[49] [50]

meals/ rest / attractions /accommodation

Only do what the user asks : strictly focus on the user ' s original query and any explicitly added requirements in follow−up messages. Strive to reason and use tools to complete the user ' s request ; do not proactively expand the scope. − If you believe there is no clear intent / key context information is missing / relevant tools are missing (whether d...

work page

[50] [51]

progress

Be solution −oriented : every turn must produce " progress " ( obtain key information or produce a usable result ) . Avoid vague suggestions and long rephrasing

work page

[51] [52]

You may not ask the user questions : make every effort to obtain information from [ context ], or rely on tools to get what is necessary

work page

[52] [53]

better experience

Do not go off−topic / do not expand: do not add new dimensions for a " better experience" (such as budget , taste preferences , trip intensity , nearby attractions , etc .) unless it directly determines the result of the current task

work page

[53] [54]

check it yourself / open an app / click a link / call / compare prices / search on a map

Converge promptly: once you have provided an executable result (can navigate directly / can book / clear next step ) , stop immediately and do not continue asking or extending suggestions . [Assumptions About User's Ability ] − The user has no ability to operate tools / search / place orders : do not ask the user to "check it yourself / open an app / clic...

work page

[54] [55]

Based on the provided real examples, understand the tool ' s output format and content characteristics

work page

[55] [56]

Generate reasonable simulated results based on the input parameters

work page

[56] [57]

Ensure the output format is consistent with the examples

work page

[57] [58]

The generated content must conform to the tool ' s business logic and real −world scenarios

work page

[58] [59]

Directly return the simulated result ; do not add any extra notes , explanations , or markdown formatting

work page

[59] [60]

Please generate reasonable simulated results based on the tool definition and parameters

Do not return a JSON wrapper; directly return the content that the tool itself should return EXAMPLES_SECTION_TEMPLATE = Below are {num_examples} real invocation examples for reference : SINGLE_EXAMPLE_TEMPLATE = Example {index}: Input parameters : {params} Output result : { result } NO_EXAMPLES_TEMPLATE = Note: No historical examples were found for {tool...

work page

[60] [61]

Similar invocation parameters should produce similar simulated results

Be sure to refer to the real invocation examples; some information may come directly from the examples provided to you. Similar invocation parameters should produce similar simulated results

work page

[61] [62]

The content must conform to the tool ' s business logic and real −world scenarios

work page

[62] [63]

If the result is a list type , generate several reasonable entries

work page

[63] [64]

Numerical values must be within reasonable ranges

work page

[64] [65]

must comply with the constraints in the parameters

Times, dates , etc . must comply with the constraints in the parameters

work page

[65] [66]

## Evaluation Objective Based on the given dialogue content , analyze the model's response along the following three dimensions:

Directly return the result content ; do not add any explanations or formatting wrappers Figure 15: Prompt Template for Tool Simulator LLM-Judge Prompt Template for Single-Turn Subtask ## Task Description Conduct a **response quality evaluation ** for a dialogue that involves tool usage, assessing the model from three core capability dimensions. ## Evaluat...

work page

[66] [67]

**Tool Usage and Planning Capability ** − Whether the model fully understands the relationship between the user ' s request and the available toolset ; whether the tool − calling trajectory is clear , reasonable , and accurate ; and whether the tool parameters are filled in appropriately and correctly

work page

[67] [68]

**Summarization and Extraction Capability ** − After obtaining the user ' s query and the tool function ' s returned response , whether the model can selectively extract the most critical information (such as required function parameters) based on the available and historical information , while avoiding fabricating facts or inventing data

work page

[68] [69]

locate first / search first , then conclude

**Final Answer Description and Presentation Capability ** − After completing planning and receiving tool return results , whether the final answer presents the information relevant to the user ' s needs clearly , accurately , and concisely . ## Core Mandatory Constraints You must treat the following as **primary inspection items throughout all three evalu...

work page

[69] [70]

estimated arrival at 15:47

The following information types **must come from tools or context **, and must not be estimated based on common knowledge or memory: − Precise times (e.g ., " estimated arrival at 15:47", " takes 6 minutes") ; − Distances , mileage, congestion length ; − Prices or fees ( taxi fare , ticket prices , airfare , tolls , etc .) ; − Real−time or date− specific ...

work page

[70] [71]

helpful

If a tool does not return certain data , but the assistant still provides seemingly " helpful " concrete values (e.g ., "a taxi costs about 12−15 Yuan", "today is 22 Celsius and sunny"), this should be considered hallucination , not a bonus. 30

work page

[71] [72]

fabricated parameters

Geographic / location − related hard constraints ( applicable to POI, administrative region , nearby, range/ radius searches , etc .) : − Key parameters such as center −point coordinates , radius , administrative region , and city must come from: explicit user input / existing context / tool returns ; otherwise they are considered " fabricated parameters ...

work page

[72] [73]

**Tool Usage and Planning Ability ** − Whether the model fully understands the relationship between user needs and the toolset ; whether the tool invocation trajectory is clear , reasonable , and accurate ; whether the planning of tool calls is correct ; and whether tool parameters are filled in appropriately and correctly

work page

[73] [74]

**Summarization and Extraction Ability ** − After obtaining the user ' s query and tool function responses , whether the model can selectively extract the most important information (e.g ., required function parameters) based on available and historical information , avoiding arbitrary fabrication of facts or data

work page

[74] [75]

**Final Answer Description and Presentation Ability ** − After completing planning and receiving tool results , whether the final response clearly , accurately , and concisely presents content relevant to the user ' s needs, and whether it appropriately interacts with or provides feedback to the user (e.g ., requesting more precise information or suggesti...

work page

[75] [76]

locate / search before concluding

**User Interaction and Follow−up Ability ** − When information is insufficient , ambiguous, or tool results are abnormal, whether the model can ask necessary and high−value questions with minimal user disruption ; whether it prioritizes inference or tool usage to supplement information ; and whether follow−up questions stay aligned with the user ' s origi...

work page

[76] [77]

estimated arrival at 15:47

The following information types must come from tools or context and must not be estimated from common knowledge or memory: * Precise times (e.g ., " estimated arrival at 15:47", " takes 6 minutes") ; * Distances , mileage, congestion lengths ; * Prices or costs ( taxi fares , tickets , airfares , tolls , etc .) ; * Real−time or date− specific weather, tem...

work page

[77] [78]

helpful

If tools do not return certain data , but the assistant provides seemingly " helpful " specific numbers (e. g ., " taxi costs about 12−15 Yuan", "today is sunny, 22 Celsius ") , this should be treated as hallucination , not a bonus. 34

work page

[78] [79]

within / outside X km

Geographic/ location hard constraints ( applicable to POI, administrative areas , nearby/ radius searches ) : * Center coordinates , radius , administrative areas , cities must come from explicit user input , existing context , or tool responses ; otherwise , they are considered fabricated parameters . * It is not allowed to assert "within / outside X km"...

work page

[79] [80]

**Scoring Accuracy**: Whether the ratings for each dimension genuinely reflect the model's actual performance in the conversation ; whether there is any obvious overestimation or underestimation . 37

work page

[80] [81]

**Reasoning and Evidence Chain**: Whether the evaluation rationale is logically clear , traceable , and grounded in key evidence ( conversation turns / tool calls / tool outputs ) , rather than vague or generic judgments

work page