pith. the verified trust layer for science. sign in

arxiv: 2510.02837 · v2 · pith:OLAS332Tnew · submitted 2025-10-03 · 💻 cs.AI · cs.CL

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Pith reviewed 2026-05-18 10:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords tool-augmented agentsreasoning trajectoriesreference-free evaluationevidence bankmulti-dimensional evaluationmeta-evaluationLLM agents
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{OLAS332T}

Prints a linked pith:OLAS332T badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

TRACE uses an evidence bank built from prior steps to evaluate the full reasoning trajectories of tool-augmented agents across multiple dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool-augmented agents face complex requests, yet evaluation often stops at whether the final answer matches, ignoring the path taken. The paper introduces TRACE to judge those full paths on efficiency, hallucination, and adaptivity without needing a complete reference trajectory. It does so by building an evidence bank that collects relevant information from the agent's earlier steps. A new dataset of trajectories with known flaws and expert labels shows that small open-source models can run TRACE successfully. When the framework is applied to agents solving actual tasks, it brings to light observations that were not visible before.

Core claim

This work introduces TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their 2

What carries the argument

The evidence bank, which accumulates knowledge from preceding steps to support reference-free scoring of reasoning quality along dimensions such as efficiency, hallucination, and adaptivity.

If this is right

  • Evaluation of tool-augmented agents can now account for process qualities in addition to final-answer accuracy.
  • Costly annotation of all possible ground-truth trajectories is no longer required for multi-dimensional assessment.
  • Small open-source LLMs become viable for performing reliable trajectory evaluations.
  • New observations about agent performance on tool tasks become accessible through automated analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Trajectory-level feedback could be used to train or refine agents more effectively than outcome-only rewards.
  • The evidence-bank idea might extend to evaluating step-by-step reasoning in other LLM applications where complete references are unavailable.
  • Developers could integrate TRACE into agent benchmarks to track improvements in reasoning habits over time.

Load-bearing premise

The reference-free evidence bank built from preceding steps can reliably capture multi-dimensional aspects of reasoning quality such as efficiency, hallucination, and adaptivity without access to exhaustive ground-truth trajectories.

What would settle it

Human raters scoring the same trajectories on efficiency, hallucination, and adaptivity and finding low correlation with the scores produced by TRACE would show that the framework does not accurately evaluate the trajectories.

Figures

Figures reproduced from arXiv: 2510.02837 by Chanyoung Park, Dongha Lee, Sangwu Park, Sein Kim, Wonjoong Kim, Yeonjun In.

Figure 1
Figure 1. Figure 1: An example of agents returning the same answer through different trajectories given the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tool outputs are stored in the evidence bank, which is used to detect hallucinations in each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Time Efficiency Comparison of LLM Evaluators using TRACE on Meta-GTA dataset. We find that most LLM models can effectively evalu￾ate the efficiency, hallucination, and adaptivity of tool￾augmented agents without relying on ground-truth trajec￾tories when using TRACE. Notably, this demonstrates that evaluation is sufficiently achievable with open-source models, avoiding the evaluation costs associated with … view at source ↗
Figure 4
Figure 4. Figure 4: Model accuracy based on the number of tokens used and dialogue turns. In this section, we analyze token usage as a potential cause of performance degradation in tool-augmented agents [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case study: Both agents are correct but trajectory efficiency is different in GPT-4.1 and [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case study: Both agents (GPT-4.1 and Qwen-72B) are incorrect but GPT-4.1 trajectory [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for generating multiple ground-truth paths. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for generating inefficiency in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for generating hallucinations in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for generating adaptivity in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for validating augmented multiple ground-truth paths. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for validating augmented inefficiency in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for validating augmented hallucinations in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for validating augmented adaptivity in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: TRACE Prompt for evaluating inefficiency. [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: TRACE Prompt for evaluating hallucinations. [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: TRACE Prompt for evaluating adaptivity. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt provided to LLM for the trajectory generation. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Formatting prompt for correcting LLM outputs. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
read the original abstract

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces TRACE, a reference-free framework for multi-dimensional evaluation of tool-augmented LLM reasoning trajectories. TRACE builds an evidence bank by accumulating knowledge from an agent's preceding steps to score dimensions including efficiency, hallucination, and adaptivity, avoiding the need for exhaustive ground-truth trajectory annotations. A new meta-evaluation dataset containing diverse and flawed trajectories, each with multi-faceted performance labels, is created to validate the approach. Experiments show that TRACE achieves accurate evaluations even when using small open-source LLMs as evaluators. The framework is further applied to trajectories from agents solving tool-augmented tasks, yielding previously unreported observations and insights.

Significance. If the central results hold, TRACE offers a practical, scalable alternative to answer-matching or full ground-truth comparison for assessing agent trajectories in complex tool-use settings. The meta-evaluation dataset constitutes a reusable contribution for future work on trajectory quality, and the application to real agents surfaces actionable insights about efficiency and error patterns. The demonstration that small open-source models suffice for evaluation lowers barriers to adoption.

major comments (1)
  1. §3 (Framework): The evidence bank is constructed solely from the agent's own preceding steps without external verification. For hallucination assessment this creates a direct risk that an erroneous fact introduced at step t is retained as valid evidence for steps t+1 onward, causing the evaluator to rate the overall trajectory as coherent rather than penalizing the initial error. The meta-evaluation dataset contains labeled flawed trajectories, yet the paper does not report a targeted ablation or error-propagation analysis that isolates cases where early hallucinations affect later evidence-bank entries. Because the central claim is that TRACE 'accurately evaluates complex trajectories' in a reference-free manner, this gap is load-bearing and requires either additional experiments or explicit qualification of the method's robustness limits.
minor comments (3)
  1. Abstract: The statement 'our results confirm that TRACE accurately evaluates...' is presented without any numerical summary (e.g., accuracy, correlation, or inter-annotator agreement figures). Adding one or two key quantitative results would make the abstract self-contained.
  2. §4 (Meta-evaluation dataset): The process by which the multi-faceted performance labels were assigned (human annotators, external knowledge sources, or LLM-assisted) is not described in sufficient detail to allow readers to assess potential label bias relative to TRACE's reference-free operation.
  3. Figure 2 / Table 1: Axis labels and legend entries use abbreviations (e.g., 'Eff', 'Hall') without an explicit key in the caption; this reduces immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment on the evidence bank construction and error propagation below, and have revised the manuscript to incorporate additional analysis and qualifications as described.

read point-by-point responses
  1. Referee: §3 (Framework): The evidence bank is constructed solely from the agent's own preceding steps without external verification. For hallucination assessment this creates a direct risk that an erroneous fact introduced at step t is retained as valid evidence for steps t+1 onward, causing the evaluator to rate the overall trajectory as coherent rather than penalizing the initial error. The meta-evaluation dataset contains labeled flawed trajectories, yet the paper does not report a targeted ablation or error-propagation analysis that isolates cases where early hallucinations affect later evidence-bank entries. Because the central claim is that TRACE 'accurately evaluates complex trajectories' in a reference-free manner, this gap is load-bearing and requires either additional experiments or explicit qualification of the method's robustness limits.

    Authors: We appreciate the referee highlighting this important consideration for our reference-free approach. We agree that constructing the evidence bank exclusively from preceding agent steps introduces a plausible risk of propagating early hallucinations, which could affect coherence assessments if not properly mitigated. At the same time, the meta-evaluation dataset was explicitly designed to include diverse flawed trajectories (with expert multi-faceted labels covering hallucination cases), and the strong alignment between TRACE scores and these labels (Section 4) provides empirical support that the framework does not simply overlook such errors. To directly respond to the concern, the revised manuscript now includes a targeted error-propagation ablation. We isolated trajectories with labeled early hallucinations, systematically varied the evidence bank contents, and measured effects on hallucination, efficiency, and adaptivity scores. Results show that while limited propagation occurs, the holistic prompting of the evaluator LLM combined with cross-dimensional scoring enables penalization of initial errors. We have added this analysis as a new subsection in Section 4.3, updated relevant tables/figures, and added explicit qualifications regarding robustness limits in the discussion and conclusion. These changes strengthen rather than undermine the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TRACE derivation chain

full rationale

The paper introduces TRACE as a new reference-free framework that accumulates an evidence bank from an agent's preceding steps to score efficiency, hallucination, and adaptivity, then validates the approach on a separately developed meta-evaluation dataset containing independently labeled flawed trajectories. No equations, fitted parameters, or performance metrics are shown to reduce by construction to quantities defined from the framework's own outputs or from self-citation chains. The central claim of accurate evaluation with small open-source LLMs rests on comparison against the external labels rather than tautological reuse of the evidence bank itself, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5701 in / 1068 out tokens · 35770 ms · 2026-05-18T10:57:58.058810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Counterfactual Trace Auditing of LLM Agent Skills

    cs.AI 2026-05 unverdicted novelty 7.0

    CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,

  3. [3]

    GPTScore: Evaluate as You Desire

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166,

  4. [4]

    Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage.arXiv preprint arXiv:2412.15606,

    Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song- Chun Zhu, and Qing Li. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage.arXiv preprint arXiv:2412.15606,

  5. [5]

    ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

  6. [6]

    Metatool benchmark for large language models: Deciding whether to use tools and which to use

    Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128,

  7. [7]

    Is safety standard same for everyone? user-specific safety evaluation of large language models.arXiv preprint arXiv:2502.15086,

    Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Kibum Kim, and Chanyoung Park. Is safety standard same for everyone? user-specific safety evaluation of large language models.arXiv preprint arXiv:2502.15086,

  8. [8]

    Under Review

    11 Preprint. Under Review. Takyoung Kim, Janvijay Singh, Shuhaib Mehri, Emre Can Acikgoz, Sagnik Mukherjee, Nimet Beyza Bozdag, Sumuk Shashidhar, Gokhan Tur, and Dilek Hakkani-T ¨ur. Pipa: A unified evaluation protocol for diagnosing interactive planning agents.arXiv preprint arXiv:2505.01592,

  9. [9]

    No free labels: Limitations of llm-as-a-judge without human grounding

    Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, and Chris Tanner. No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061,

  10. [10]

    Mmedagent: Learning to use medical tools with multi-modal agent.arXiv preprint arXiv:2407.02483, 2024a

    Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi-modal agent.arXiv preprint arXiv:2407.02483, 2024a. Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. Salad-bench: A hierarchical and comp...

  11. [11]

    Needle in the haystack for memory based large language models.arXiv preprint arXiv:2407.01437,

    Elliot Nelson, Georgios Kollias, Payel Das, Subhajit Chaudhury, and Soham Dan. Needle in the haystack for memory based large language models.arXiv preprint arXiv:2407.01437,

  12. [12]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

  13. [13]

    From exploration to mastery: Enabling llms to master tools via self-driven interac- tions.arXiv preprint arXiv:2410.08197,

    Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. From exploration to mastery: Enabling llms to master tools via self-driven interac- tions.arXiv preprint arXiv:2410.08197,

  14. [14]

    arXiv preprint arXiv:2401.06201

    Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li, and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction.arXiv preprint arXiv:2401.06201,

  15. [15]

    Which agent causes task failures and when? on auto- mated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212,

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on auto- mated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212,

  16. [16]

    A multimodal foundation agent for financial trading: Tool- augmented, diversified, and generalist

    Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. A multimodal foundation agent for financial trading: Tool- augmented, diversified, and generalist. InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining, pp. 4314–4325, 2024a. Yuxiang Zhang, Jing ...

  17. [17]

    is it hallucinating?

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent- as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934,

  18. [18]

    Under Review

    13 Preprint. Under Review. A COMPLETERELATEDWORK A.1 TOOL-AUGMENTEDLLM AGENT The capabilities of LLMs have been significantly extended by integrating external tools, giving rise to tool-augmented agents (Zhao et al., 2023). Foundational to this paradigm is the ability of LLMs to generate intermediate reasoning steps. Early work on Chain-of-Thought (CoT) p...

  19. [19]

    demonstrated the ability to facilitate LLMs in using over 16,000 real-world APIs, showcasing remarkable generalization in tool use. Others have focused on creating specialized agents for specific domains that demand high precision, such as mathe- matical problem-solving with TORA (Gou et al., 2023), medical task assistance with MMedAgent (Li et al., 2024a...

  20. [20]

    As complexity has grown, efforts have also been made to improve the efficiency of tool interaction itself, through methods like providing concise tool instructions (Yuan et al.,

    and other vision-language model-driven agents that can interpret and act upon visual information (Gao et al., 2024). As complexity has grown, efforts have also been made to improve the efficiency of tool interaction itself, through methods like providing concise tool instructions (Yuan et al.,

  21. [21]

    A.2 EVALUATION OFTOOL-AUGMENTEDAGENTS The rapid development of complex, multi-step agents necessitates robust and comprehensive evalu- ation benchmarks

    or enabling models to self-improve tool documentation (Qu et al., 2024). A.2 EVALUATION OFTOOL-AUGMENTEDAGENTS The rapid development of complex, multi-step agents necessitates robust and comprehensive evalu- ation benchmarks. A number of benchmarks have been proposed to assess agent capabilities across different tasks. For example, GAIA (Mialon et al.,

  22. [22]

    and m&m’s (Ma et al., 2024), for instance, are constrained by primarily validating an agent’s trajectory against a single, pre-defined ground-truth sequence. This approach not only penalizes agents that discover alternative, valid solution paths but also scales poorly as tool complexity increases, as manually enumerating all possible correct paths is comp...

  23. [23]

    To offer a clear and comprehensive context for our experimental re- sults, we present a summary of their key statistics in Table

    Model Name Used Version OpenAI API GPT-5-mini gpt-5-mini-2025-08-07 GPT-4.1 gpt-4.1-2025-04-14 o3-mini o3-mini-2025-01-31 GPT-4o gpt-4o-2024-11-20 Claude API Claude-sonnet-4 claude-sonnet-4-20250514 Gemini API Gemini-2.5-pro gemini-2.5-pro-preview-06-05 TogetherAI API Llama-3.1-8B-Instructmeta-llama/Meta-Llama-3.1-8B-Instruct-Turbo Llama-3.3-70B-Instructm...

  24. [24]

    Additionally, the maximum number of action turns per query was limited to 10; any query exceeding this limit was automatically treated as a failure

    C.1 HYPERPARAMETERSETTING To ensure experimental reproducibility, we set the temperature to 0 and fixed the max tokens at 4096 for all trials. Additionally, the maximum number of action turns per query was limited to 10; any query exceeding this limit was automatically treated as a failure. C.2 GENERATIONDETAILS Building upon a prior study (Wang et al., 2...