arxiv: 2510.02837 · v2 · pith:OLAS332Tnew · submitted 2025-10-03 · 💻 cs.AI · cs.CL

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim , Sangwu Park , Yeonjun In , Sein Kim , Dongha Lee , Chanyoung Park This is my paper

Pith reviewed 2026-05-18 10:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords tool-augmented agentsreasoning trajectoriesreference-free evaluationevidence bankmulti-dimensional evaluationmeta-evaluationLLM agents

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{OLAS332T}

Prints a linked pith:OLAS332T badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

TRACE uses an evidence bank built from prior steps to evaluate the full reasoning trajectories of tool-augmented agents across multiple dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool-augmented agents face complex requests, yet evaluation often stops at whether the final answer matches, ignoring the path taken. The paper introduces TRACE to judge those full paths on efficiency, hallucination, and adaptivity without needing a complete reference trajectory. It does so by building an evidence bank that collects relevant information from the agent's earlier steps. A new dataset of trajectories with known flaws and expert labels shows that small open-source models can run TRACE successfully. When the framework is applied to agents solving actual tasks, it brings to light observations that were not visible before.

Core claim

This work introduces TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their 2

What carries the argument

The evidence bank, which accumulates knowledge from preceding steps to support reference-free scoring of reasoning quality along dimensions such as efficiency, hallucination, and adaptivity.

If this is right

Evaluation of tool-augmented agents can now account for process qualities in addition to final-answer accuracy.
Costly annotation of all possible ground-truth trajectories is no longer required for multi-dimensional assessment.
Small open-source LLMs become viable for performing reliable trajectory evaluations.
New observations about agent performance on tool tasks become accessible through automated analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Trajectory-level feedback could be used to train or refine agents more effectively than outcome-only rewards.
The evidence-bank idea might extend to evaluating step-by-step reasoning in other LLM applications where complete references are unavailable.
Developers could integrate TRACE into agent benchmarks to track improvements in reasoning habits over time.

Load-bearing premise

The reference-free evidence bank built from preceding steps can reliably capture multi-dimensional aspects of reasoning quality such as efficiency, hallucination, and adaptivity without access to exhaustive ground-truth trajectories.

What would settle it

Human raters scoring the same trajectories on efficiency, hallucination, and adaptivity and finding low correlation with the scores produced by TRACE would show that the framework does not accurately evaluate the trajectories.

Figures

Figures reproduced from arXiv: 2510.02837 by Chanyoung Park, Dongha Lee, Sangwu Park, Sein Kim, Wonjoong Kim, Yeonjun In.

**Figure 2.** Figure 2: Tool outputs are stored in the evidence bank, which is used to detect hallucinations in each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Time Efficiency Comparison of LLM Evaluators using TRACE on Meta-GTA dataset. We find that most LLM models can effectively evaluate the efficiency, hallucination, and adaptivity of toolaugmented agents without relying on ground-truth trajectories when using TRACE. Notably, this demonstrates that evaluation is sufficiently achievable with open-source models, avoiding the evaluation costs associated with … view at source ↗

**Figure 4.** Figure 4: Model accuracy based on the number of tokens used and dialogue turns. In this section, we analyze token usage as a potential cause of performance degradation in tool-augmented agents [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Case study: Both agents are correct but trajectory efficiency is different in GPT-4.1 and [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Case study: Both agents (GPT-4.1 and Qwen-72B) are incorrect but GPT-4.1 trajectory [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for generating multiple ground-truth paths. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for generating inefficiency in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for generating hallucinations in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for generating adaptivity in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for validating augmented multiple ground-truth paths. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for validating augmented inefficiency in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for validating augmented hallucinations in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for validating augmented adaptivity in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: TRACE Prompt for evaluating inefficiency. [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: TRACE Prompt for evaluating hallucinations. [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: TRACE Prompt for evaluating adaptivity. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt provided to LLM for the trajectory generation. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Formatting prompt for correcting LLM outputs. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

read the original abstract

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE gives a reference-free way to score full agent trajectories on efficiency, hallucination and adaptivity, but the evidence bank from prior steps risks missing propagated errors.

read the letter

The main thing here is that they built TRACE to judge tool-augmented agent paths on more than just the final answer. They accumulate an evidence bank from the agent's own earlier steps and use it to rate efficiency, hallucination, and adaptivity without needing every possible ground-truth trajectory. They also released a meta-evaluation dataset with labeled flawed trajectories to test the approach, and they report that small open-source models can run the evaluation.

Referee Report

1 major / 3 minor

Summary. The paper introduces TRACE, a reference-free framework for multi-dimensional evaluation of tool-augmented LLM reasoning trajectories. TRACE builds an evidence bank by accumulating knowledge from an agent's preceding steps to score dimensions including efficiency, hallucination, and adaptivity, avoiding the need for exhaustive ground-truth trajectory annotations. A new meta-evaluation dataset containing diverse and flawed trajectories, each with multi-faceted performance labels, is created to validate the approach. Experiments show that TRACE achieves accurate evaluations even when using small open-source LLMs as evaluators. The framework is further applied to trajectories from agents solving tool-augmented tasks, yielding previously unreported observations and insights.

Significance. If the central results hold, TRACE offers a practical, scalable alternative to answer-matching or full ground-truth comparison for assessing agent trajectories in complex tool-use settings. The meta-evaluation dataset constitutes a reusable contribution for future work on trajectory quality, and the application to real agents surfaces actionable insights about efficiency and error patterns. The demonstration that small open-source models suffice for evaluation lowers barriers to adoption.

major comments (1)

§3 (Framework): The evidence bank is constructed solely from the agent's own preceding steps without external verification. For hallucination assessment this creates a direct risk that an erroneous fact introduced at step t is retained as valid evidence for steps t+1 onward, causing the evaluator to rate the overall trajectory as coherent rather than penalizing the initial error. The meta-evaluation dataset contains labeled flawed trajectories, yet the paper does not report a targeted ablation or error-propagation analysis that isolates cases where early hallucinations affect later evidence-bank entries. Because the central claim is that TRACE 'accurately evaluates complex trajectories' in a reference-free manner, this gap is load-bearing and requires either additional experiments or explicit qualification of the method's robustness limits.

minor comments (3)

Abstract: The statement 'our results confirm that TRACE accurately evaluates...' is presented without any numerical summary (e.g., accuracy, correlation, or inter-annotator agreement figures). Adding one or two key quantitative results would make the abstract self-contained.
§4 (Meta-evaluation dataset): The process by which the multi-faceted performance labels were assigned (human annotators, external knowledge sources, or LLM-assisted) is not described in sufficient detail to allow readers to assess potential label bias relative to TRACE's reference-free operation.
Figure 2 / Table 1: Axis labels and legend entries use abbreviations (e.g., 'Eff', 'Hall') without an explicit key in the caption; this reduces immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment on the evidence bank construction and error propagation below, and have revised the manuscript to incorporate additional analysis and qualifications as described.

read point-by-point responses

Referee: §3 (Framework): The evidence bank is constructed solely from the agent's own preceding steps without external verification. For hallucination assessment this creates a direct risk that an erroneous fact introduced at step t is retained as valid evidence for steps t+1 onward, causing the evaluator to rate the overall trajectory as coherent rather than penalizing the initial error. The meta-evaluation dataset contains labeled flawed trajectories, yet the paper does not report a targeted ablation or error-propagation analysis that isolates cases where early hallucinations affect later evidence-bank entries. Because the central claim is that TRACE 'accurately evaluates complex trajectories' in a reference-free manner, this gap is load-bearing and requires either additional experiments or explicit qualification of the method's robustness limits.

Authors: We appreciate the referee highlighting this important consideration for our reference-free approach. We agree that constructing the evidence bank exclusively from preceding agent steps introduces a plausible risk of propagating early hallucinations, which could affect coherence assessments if not properly mitigated. At the same time, the meta-evaluation dataset was explicitly designed to include diverse flawed trajectories (with expert multi-faceted labels covering hallucination cases), and the strong alignment between TRACE scores and these labels (Section 4) provides empirical support that the framework does not simply overlook such errors. To directly respond to the concern, the revised manuscript now includes a targeted error-propagation ablation. We isolated trajectories with labeled early hallucinations, systematically varied the evidence bank contents, and measured effects on hallucination, efficiency, and adaptivity scores. Results show that while limited propagation occurs, the holistic prompting of the evaluator LLM combined with cross-dimensional scoring enables penalization of initial errors. We have added this analysis as a new subsection in Section 4.3, updated relevant tables/figures, and added explicit qualifications regarding robustness limits in the discussion and conclusion. These changes strengthen rather than undermine the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TRACE derivation chain

full rationale

The paper introduces TRACE as a new reference-free framework that accumulates an evidence bank from an agent's preceding steps to score efficiency, hallucination, and adaptivity, then validates the approach on a separately developed meta-evaluation dataset containing independently labeled flawed trajectories. No equations, fitted parameters, or performance metrics are shown to reduce by construction to quantities defined from the framework's own outputs or from self-citation chains. The central claim of accurate evaluation with small open-source LLMs rests on comparison against the external labels rather than tautological reuse of the evidence bank itself, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5701 in / 1068 out tokens · 35770 ms · 2026-05-18T10:57:58.058810+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The central component of the TRACE framework is the evidence bank, denoted as E. ... At each step t=1,2,3..., the agent generates a new piece of evidence, e_t ... E_t = {e_1, ..., e_t}
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A thought is considered a hallucination if it contains information or makes assumptions that cannot be substantiated by the contents of the evidence bank from the previous steps, E_{t-1}.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Counterfactual Trace Auditing of LLM Agent Skills
cs.AI 2026-05 unverdicted novelty 7.0

CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

GPTScore: Evaluate as You Desire

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage.arXiv preprint arXiv:2412.15606,

Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song- Chun Zhu, and Qing Li. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage.arXiv preprint arXiv:2412.15606,

work page arXiv
[5]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

work page internal anchor Pith review arXiv
[6]

Metatool benchmark for large language models: Deciding whether to use tools and which to use

Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128,

work page arXiv
[7]

Is safety standard same for everyone? user-specific safety evaluation of large language models.arXiv preprint arXiv:2502.15086,

Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Kibum Kim, and Chanyoung Park. Is safety standard same for everyone? user-specific safety evaluation of large language models.arXiv preprint arXiv:2502.15086,

work page arXiv
[8]

Under Review

11 Preprint. Under Review. Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, and Dilek Hakkani-T ¨ur. Pipa: A unified evaluation protocol for diagnosing interactive planning agents.arXiv preprint arXiv:2505.01592,

work page arXiv
[9]

No free labels: Limitations of llm-as-a-judge without human grounding

Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061,

work page arXiv
[10]

Mmedagent: Learning to use medical tools with multi-modal agent.arXiv preprint arXiv:2407.02483, 2024a

Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi-modal agent.arXiv preprint arXiv:2407.02483, 2024a. Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comp...

work page arXiv
[11]

Needle in the haystack for memory based large language models.arXiv preprint arXiv:2407.01437,

Elliot Nelson, Georgios Kollias, Payel Das, Subhajit Chaudhury, and Soham Dan. Needle in the haystack for memory based large language models.arXiv preprint arXiv:2407.01437,

work page arXiv
[12]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

From exploration to mastery: Enabling llms to master tools via self-driven interac- tions.arXiv preprint arXiv:2410.08197,

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. From exploration to mastery: Enabling llms to master tools via self-driven interac- tions.arXiv preprint arXiv:2410.08197,

work page arXiv
[14]

arXiv preprint arXiv:2401.06201

Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction.arXiv preprint arXiv:2401.06201,

work page arXiv
[15]

Which agent causes task failures and when? on auto- mated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212,

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on auto- mated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212,

work page arXiv
[16]

A multimodal foundation agent for financial trading: Tool- augmented, diversified, and generalist

Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. A multimodal foundation agent for financial trading: Tool- augmented, diversified, and generalist. InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining, pp. 4314–4325, 2024a. Yuxiang Zhang, Jing ...

work page arXiv
[17]

is it hallucinating?

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934,

work page arXiv
[18]

Under Review

13 Preprint. Under Review. A COMPLETERELATEDWORK A.1 TOOL-AUGMENTEDLLM AGENT The capabilities of LLMs have been significantly extended by integrating external tools, giving rise to tool-augmented agents (Zhao et al., 2023). Foundational to this paradigm is the ability of LLMs to generate intermediate reasoning steps. Early work on Chain-of-Thought (CoT) p...

work page 2023
[19]

demonstrated the ability to facilitate LLMs in using over 16,000 real-world APIs, showcasing remarkable generalization in tool use. Others have focused on creating specialized agents for specific domains that demand high precision, such as mathe- matical problem-solving with TORA (Gou et al., 2023), medical task assistance with MMedAgent (Li et al., 2024a...

work page 2023
[20]

As complexity has grown, efforts have also been made to improve the efficiency of tool interaction itself, through methods like providing concise tool instructions (Yuan et al.,

and other vision-language model-driven agents that can interpret and act upon visual information (Gao et al., 2024). As complexity has grown, efforts have also been made to improve the efficiency of tool interaction itself, through methods like providing concise tool instructions (Yuan et al.,

work page 2024
[21]

A.2 EVALUATION OFTOOL-AUGMENTEDAGENTS The rapid development of complex, multi-step agents necessitates robust and comprehensive evalu- ation benchmarks

or enabling models to self-improve tool documentation (Qu et al., 2024). A.2 EVALUATION OFTOOL-AUGMENTEDAGENTS The rapid development of complex, multi-step agents necessitates robust and comprehensive evalu- ation benchmarks. A number of benchmarks have been proposed to assess agent capabilities across different tasks. For example, GAIA (Mialon et al.,

work page 2024
[22]

and m&m’s (Ma et al., 2024), for instance, are constrained by primarily validating an agent’s trajectory against a single, pre-defined ground-truth sequence. This approach not only penalizes agents that discover alternative, valid solution paths but also scales poorly as tool complexity increases, as manually enumerating all possible correct paths is comp...

work page 2024
[23]

To offer a clear and comprehensive context for our experimental re- sults, we present a summary of their key statistics in Table

Model Name Used Version OpenAI API GPT-5-mini gpt-5-mini-2025-08-07 GPT-4.1 gpt-4.1-2025-04-14 o3-mini o3-mini-2025-01-31 GPT-4o gpt-4o-2024-11-20 Claude API Claude-sonnet-4 claude-sonnet-4-20250514 Gemini API Gemini-2.5-pro gemini-2.5-pro-preview-06-05 TogetherAI API Llama-3.1-8B-Instructmeta-llama/Meta-Llama-3.1-8B-Instruct-Turbo Llama-3.3-70B-Instructm...

work page 2025
[24]

Additionally, the maximum number of action turns per query was limited to 10; any query exceeding this limit was automatically treated as a failure

C.1 HYPERPARAMETERSETTING To ensure experimental reproducibility, we set the temperature to 0 and fixed the max tokens at 4096 for all trials. Additionally, the maximum number of action turns per query was limited to 10; any query exceeding this limit was automatically treated as a failure. C.2 GENERATIONDETAILS Building upon a prior study (Wang et al., 2...

work page 2024