Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Chanyoung Park; Dongha Lee; Sangwu Park; Sein Kim; Wonjoong Kim; Yeonjun In

arxiv: 2510.02837 · v3 · pith:OLAS332Tnew · submitted 2025-10-03 · 💻 cs.AI · cs.CL

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim , Sangwu Park , Yeonjun In , Sein Kim , Dongha Lee , Chanyoung Park This is my paper

Pith reviewed 2026-05-18 10:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords tool-augmented agentsreasoning trajectoriesreference-free evaluationevidence bankmulti-dimensional evaluationmeta-evaluationLLM agents

0 comments

The pith

TRACE uses an evidence bank built from prior steps to evaluate the full reasoning trajectories of tool-augmented agents across multiple dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool-augmented agents face complex requests, yet evaluation often stops at whether the final answer matches, ignoring the path taken. The paper introduces TRACE to judge those full paths on efficiency, hallucination, and adaptivity without needing a complete reference trajectory. It does so by building an evidence bank that collects relevant information from the agent's earlier steps. A new dataset of trajectories with known flaws and expert labels shows that small open-source models can run TRACE successfully. When the framework is applied to agents solving actual tasks, it brings to light observations that were not visible before.

Core claim

This work introduces TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their 2

What carries the argument

The evidence bank, which accumulates knowledge from preceding steps to support reference-free scoring of reasoning quality along dimensions such as efficiency, hallucination, and adaptivity.

If this is right

Evaluation of tool-augmented agents can now account for process qualities in addition to final-answer accuracy.
Costly annotation of all possible ground-truth trajectories is no longer required for multi-dimensional assessment.
Small open-source LLMs become viable for performing reliable trajectory evaluations.
New observations about agent performance on tool tasks become accessible through automated analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Trajectory-level feedback could be used to train or refine agents more effectively than outcome-only rewards.
The evidence-bank idea might extend to evaluating step-by-step reasoning in other LLM applications where complete references are unavailable.
Developers could integrate TRACE into agent benchmarks to track improvements in reasoning habits over time.

Load-bearing premise

The reference-free evidence bank built from preceding steps can reliably capture multi-dimensional aspects of reasoning quality such as efficiency, hallucination, and adaptivity without access to exhaustive ground-truth trajectories.

What would settle it

Human raters scoring the same trajectories on efficiency, hallucination, and adaptivity and finding low correlation with the scores produced by TRACE would show that the framework does not accurately evaluate the trajectories.

Figures

Figures reproduced from arXiv: 2510.02837 by Chanyoung Park, Dongha Lee, Sangwu Park, Sein Kim, Wonjoong Kim, Yeonjun In.

**Figure 2.** Figure 2: Tool outputs are stored in the evidence bank, which is used to detect hallucinations in each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Time Efficiency Comparison of LLM Evaluators using TRACE on Meta-GTA dataset. We find that most LLM models can effectively evaluate the efficiency, hallucination, and adaptivity of toolaugmented agents without relying on ground-truth trajectories when using TRACE. Notably, this demonstrates that evaluation is sufficiently achievable with open-source models, avoiding the evaluation costs associated with … view at source ↗

**Figure 4.** Figure 4: Model accuracy based on the number of tokens used and dialogue turns. In this section, we analyze token usage as a potential cause of performance degradation in tool-augmented agents [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Case study: Both agents are correct but trajectory efficiency is different in GPT-4.1 and [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Case study: Both agents (GPT-4.1 and Qwen-72B) are incorrect but GPT-4.1 trajectory [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for generating multiple ground-truth paths. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for generating inefficiency in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for generating hallucinations in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for generating adaptivity in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for validating augmented multiple ground-truth paths. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for validating augmented inefficiency in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for validating augmented hallucinations in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for validating augmented adaptivity in the Meta evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: TRACE Prompt for evaluating inefficiency. [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: TRACE Prompt for evaluating hallucinations. [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: TRACE Prompt for evaluating adaptivity. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt provided to LLM for the trajectory generation. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Formatting prompt for correcting LLM outputs. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

read the original abstract

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE gives a reference-free way to score full agent trajectories on efficiency, hallucination and adaptivity, but the evidence bank from prior steps risks missing propagated errors.

read the letter

The main thing here is that they built TRACE to judge tool-augmented agent paths on more than just the final answer. They accumulate an evidence bank from the agent's own earlier steps and use it to rate efficiency, hallucination, and adaptivity without needing every possible ground-truth trajectory. They also released a meta-evaluation dataset with labeled flawed trajectories to test the approach, and they report that small open-source models can run the evaluation.

Referee Report

1 major / 3 minor

Summary. The paper introduces TRACE, a reference-free framework for multi-dimensional evaluation of tool-augmented LLM reasoning trajectories. TRACE builds an evidence bank by accumulating knowledge from an agent's preceding steps to score dimensions including efficiency, hallucination, and adaptivity, avoiding the need for exhaustive ground-truth trajectory annotations. A new meta-evaluation dataset containing diverse and flawed trajectories, each with multi-faceted performance labels, is created to validate the approach. Experiments show that TRACE achieves accurate evaluations even when using small open-source LLMs as evaluators. The framework is further applied to trajectories from agents solving tool-augmented tasks, yielding previously unreported observations and insights.

Significance. If the central results hold, TRACE offers a practical, scalable alternative to answer-matching or full ground-truth comparison for assessing agent trajectories in complex tool-use settings. The meta-evaluation dataset constitutes a reusable contribution for future work on trajectory quality, and the application to real agents surfaces actionable insights about efficiency and error patterns. The demonstration that small open-source models suffice for evaluation lowers barriers to adoption.

major comments (1)

§3 (Framework): The evidence bank is constructed solely from the agent's own preceding steps without external verification. For hallucination assessment this creates a direct risk that an erroneous fact introduced at step t is retained as valid evidence for steps t+1 onward, causing the evaluator to rate the overall trajectory as coherent rather than penalizing the initial error. The meta-evaluation dataset contains labeled flawed trajectories, yet the paper does not report a targeted ablation or error-propagation analysis that isolates cases where early hallucinations affect later evidence-bank entries. Because the central claim is that TRACE 'accurately evaluates complex trajectories' in a reference-free manner, this gap is load-bearing and requires either additional experiments or explicit qualification of the method's robustness limits.

minor comments (3)

Abstract: The statement 'our results confirm that TRACE accurately evaluates...' is presented without any numerical summary (e.g., accuracy, correlation, or inter-annotator agreement figures). Adding one or two key quantitative results would make the abstract self-contained.
§4 (Meta-evaluation dataset): The process by which the multi-faceted performance labels were assigned (human annotators, external knowledge sources, or LLM-assisted) is not described in sufficient detail to allow readers to assess potential label bias relative to TRACE's reference-free operation.
Figure 2 / Table 1: Axis labels and legend entries use abbreviations (e.g., 'Eff', 'Hall') without an explicit key in the caption; this reduces immediate readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment on the evidence bank construction and error propagation below, and have revised the manuscript to incorporate additional analysis and qualifications as described.

read point-by-point responses

Referee: §3 (Framework): The evidence bank is constructed solely from the agent's own preceding steps without external verification. For hallucination assessment this creates a direct risk that an erroneous fact introduced at step t is retained as valid evidence for steps t+1 onward, causing the evaluator to rate the overall trajectory as coherent rather than penalizing the initial error. The meta-evaluation dataset contains labeled flawed trajectories, yet the paper does not report a targeted ablation or error-propagation analysis that isolates cases where early hallucinations affect later evidence-bank entries. Because the central claim is that TRACE 'accurately evaluates complex trajectories' in a reference-free manner, this gap is load-bearing and requires either additional experiments or explicit qualification of the method's robustness limits.

Authors: We appreciate the referee highlighting this important consideration for our reference-free approach. We agree that constructing the evidence bank exclusively from preceding agent steps introduces a plausible risk of propagating early hallucinations, which could affect coherence assessments if not properly mitigated. At the same time, the meta-evaluation dataset was explicitly designed to include diverse flawed trajectories (with expert multi-faceted labels covering hallucination cases), and the strong alignment between TRACE scores and these labels (Section 4) provides empirical support that the framework does not simply overlook such errors. To directly respond to the concern, the revised manuscript now includes a targeted error-propagation ablation. We isolated trajectories with labeled early hallucinations, systematically varied the evidence bank contents, and measured effects on hallucination, efficiency, and adaptivity scores. Results show that while limited propagation occurs, the holistic prompting of the evaluator LLM combined with cross-dimensional scoring enables penalization of initial errors. We have added this analysis as a new subsection in Section 4.3, updated relevant tables/figures, and added explicit qualifications regarding robustness limits in the discussion and conclusion. These changes strengthen rather than undermine the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TRACE derivation chain

full rationale

The paper introduces TRACE as a new reference-free framework that accumulates an evidence bank from an agent's preceding steps to score efficiency, hallucination, and adaptivity, then validates the approach on a separately developed meta-evaluation dataset containing independently labeled flawed trajectories. No equations, fitted parameters, or performance metrics are shown to reduce by construction to quantities defined from the framework's own outputs or from self-citation chains. The central claim of accurate evaluation with small open-source LLMs rests on comparison against the external labels rather than tautological reuse of the evidence bank itself, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5701 in / 1068 out tokens · 35770 ms · 2026-05-18T10:57:58.058810+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The central component of the TRACE framework is the evidence bank, denoted as E. ... At each step t=1,2,3..., the agent generates a new piece of evidence, e_t ... E_t = {e_1, ..., e_t}
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A thought is considered a hallucination if it contains information or makes assumptions that cannot be substantiated by the contents of the evidence bank from the previous steps, E_{t-1}.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Counterfactual Trace Auditing of LLM Agent Skills
cs.AI 2026-05 unverdicted novelty 7.0

CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
cs.AI 2026-05 unverdicted novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.