pith. sign in

arxiv: 2606.23112 · v1 · pith:5ICYWXVGnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· cs.CL

Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning

Pith reviewed 2026-06-26 09:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords tool-calling agentsmulti-turn agentsdirect preference optimizationdivergence pointgraph-based planningself-improvementtau2-bench
0
0 comments X

The pith

ToolGraph structures multi-turn tool agents with schema topology and rollout weights, then refines them via 161 divergence-point preference pairs trained with DPO under matching context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multi-turn tool-calling agents suffer from unstructured tool selection and prompt mismatch between training and deployment. ToolGraph addresses this by building a directed graph from tool schemas, estimating transition probabilities from successful trajectories, and adding controls for write prerequisites and search loops. Preference pairs are then located at divergence points using state matching and prefix alignment, filtered for action correctness, and used to run DPO inside the same ToolGraph setup. On 375 tau2-bench tasks this lifts weighted average reward from 0.304 to 0.355, with most of the DPO gain appearing in airline and retail domains.

Core claim

ToolGraph raises weighted reward from 0.304 to 0.338; adding DPO on 161 state-matched preference pairs raises it further to 0.355, with the additional gain concentrated in airline and retail while roughly half of telecom trajectories still exhaust the step budget.

What carries the argument

ToolGraph, a schema-derived directed graph whose nodes are tool calls, edges carry transition weights estimated from successful rollouts, and whose controls enforce write prerequisites and repeated-search prevention.

If this is right

  • The graph topology alone accounts for an 11.2 percent relative reward increase before any preference tuning.
  • DPO gains concentrate in airline and retail domains, implying domain-specific divergence patterns matter.
  • Chosen reward positivity is the strongest checkpoint signal among the 16 DPO configurations tested.
  • Telecom tasks are limited by step budget exhaustion before action execution in about half the trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the graph can be built from schemas without rollout data, the method could apply to new tool sets with no prior trajectories.
  • The state-matching technique for locating divergence points may generalize to other multi-turn agent settings where full trajectories are expensive to label.
  • Because DPO is run under the identical ToolGraph context used at test time, the approach reduces one common source of distribution shift in preference tuning.

Load-bearing premise

The 161 preference pairs extracted by state matching and prefix alignment remain high-quality and free of train-deployment mismatch when DPO is applied inside the ToolGraph context.

What would settle it

Run the same DPO step on the 161 pairs but replace the ToolGraph context at inference time with the original ungraph prompt; if the 0.017 absolute gain disappears, the assumption fails.

Figures

Figures reproduced from arXiv: 2606.23112 by Jiaqiang Tang.

Figure 1
Figure 1. Figure 1: System pipeline overview. The agent runs on vLLM-served Qwen 3.5 9B (execution layer). ToolGraph and the DPO [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Main tau2-bench results across four domains. Parenthetical values indicate task counts. ToolGraph improves the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Multi-turn tool-using agents must coordinate long-horizon tool sequences while tracking dialogue state and policy constraints. Existing approaches often separate inference-time orchestration from parameter-level learning, leaving tool selection weakly structured and preference updates vulnerable to train--deployment prompt mismatch. For within-benchmark self-improvement, ToolGraph combines schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls for write prerequisites and repeated-search loops. We then construct 161 preference pairs by locating divergence points via state-based matching and prefix-based alignment, filtered through action-correctness annotations, and train DPO under the same ToolGraph context used at inference. Across 375 tau2-bench tasks, ToolGraph raises the weighted average reward from 0.304 to 0.338 (+11.2% relative), while ToolGraph+DPO reaches 0.355 (+16.8% over the baseline), with the DPO gain concentrated in airline and retail. Fine-grained diagnostics further show that roughly half of telecom trajectories exhaust the step budget before action execution and that chosen reward positivity is the most useful checkpoint signal across our 16 evaluated DPO configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ToolGraph, a method for multi-turn tool-calling agents that integrates schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls to manage write prerequisites and repeated-search loops. It then constructs 161 preference pairs by locating divergence points via state-based matching and prefix-based alignment, filtered by action-correctness annotations, and applies DPO training under the same ToolGraph context. On 375 tau2-bench tasks, ToolGraph improves weighted average reward from 0.304 to 0.338 (+11.2%), with ToolGraph+DPO reaching 0.355 (+16.8% over baseline), with gains concentrated in airline and retail domains. Diagnostics note step-budget exhaustion in telecom trajectories and the utility of chosen reward positivity across 16 DPO configurations.

Significance. If the reported DPO gains are shown to be causal and free of prompt mismatch or label noise, the work would provide a concrete pipeline for within-benchmark self-evolution of tool-calling agents that reuses rollout statistics for both graph construction and preference data. The concentration of gains in specific domains and the diagnostic findings on step budgets offer potentially actionable insights for long-horizon tool use.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (pair construction): the +0.017 absolute gain attributed to DPO on the 161 pairs is load-bearing for the self-evolution claim, yet no ablation isolates the contribution of divergence-point selection versus the ToolGraph wrapper itself, nor verifies that the state-based matching and action-correctness filtering produce pairs whose quality holds under the exact ToolGraph prompt template used at inference.
  2. [Abstract and transition-weight estimation section] Abstract and transition-weight estimation section: transition weights are derived from successful rollouts on the tau2-bench distribution and then used to generate the DPO pairs on the same distribution; this creates a circularity risk where the reported improvement partly reflects re-use of benchmark-derived statistics rather than an independent preference-learning signal.
  3. [Abstract] Abstract: the manuscript states the pairs are “filtered through action-correctness annotations” and trained “under the same ToolGraph context,” but supplies no quantitative check (prompt-token overlap, inter-annotator agreement on correctness labels, or ablation training DPO on the pairs without the ToolGraph wrapper) that would confirm absence of train-deployment mismatch.
minor comments (2)
  1. [Abstract] No error bars or statistical significance tests are reported for the weighted-average reward figures (0.304, 0.338, 0.355).
  2. [Abstract] The abstract mentions “16 evaluated DPO configurations” but does not specify the hyperparameter ranges or which checkpoint signal was most useful beyond the qualitative statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for stronger evidence on the causality of DPO gains and controls for train-deployment consistency. We address each major comment below with clarifications and commit to targeted revisions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (pair construction): the +0.017 absolute gain attributed to DPO on the 161 pairs is load-bearing for the self-evolution claim, yet no ablation isolates the contribution of divergence-point selection versus the ToolGraph wrapper itself, nor verifies that the state-based matching and action-correctness filtering produce pairs whose quality holds under the exact ToolGraph prompt template used at inference.

    Authors: We acknowledge the absence of an explicit ablation isolating divergence-point selection from the ToolGraph wrapper. All reported DPO runs used the identical ToolGraph prompt template at both training and inference to maintain consistency. We will add an ablation that trains DPO on the same 161 pairs but without the ToolGraph context (i.e., standard prompt) in the revised manuscript to quantify the wrapper's contribution. revision: yes

  2. Referee: [Abstract and transition-weight estimation section] Abstract and transition-weight estimation section: transition weights are derived from successful rollouts on the tau2-bench distribution and then used to generate the DPO pairs on the same distribution; this creates a circularity risk where the reported improvement partly reflects re-use of benchmark-derived statistics rather than an independent preference-learning signal.

    Authors: The reuse of successful rollouts for transition weights is intentional in the self-evolution setting, as the graph encodes benchmark-specific topology to guide both inference and pair construction. Divergence-point pairs are formed only where trajectories differ in outcome despite shared prefixes, supplying a preference signal beyond the weights alone. We will expand the transition-weight section with an explicit discussion of this design choice and its implications for within-benchmark improvement. revision: partial

  3. Referee: [Abstract] Abstract: the manuscript states the pairs are “filtered through action-correctness annotations” and trained “under the same ToolGraph context,” but supplies no quantitative check (prompt-token overlap, inter-annotator agreement on correctness labels, or ablation training DPO on the pairs without the ToolGraph wrapper) that would confirm absence of train-deployment mismatch.

    Authors: We agree that quantitative checks on the filtering and context consistency would strengthen the claim. The action-correctness annotations were performed by the authors using the ToolGraph template. In the revision we will report (i) token-overlap statistics between training and inference prompts and (ii) inter-annotator agreement on the correctness labels. revision: yes

Circularity Check

1 steps flagged

Transition weights estimated from benchmark rollouts and DPO pairs derived from them create partial reuse of evaluation data.

specific steps
  1. fitted input called prediction [Abstract]
    "ToolGraph combines schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls for write prerequisites and repeated-search loops. We then construct 161 preference pairs by locating divergence points via state-based matching and prefix-based alignment, filtered through action-correctness annotations, and train DPO under the same ToolGraph context used at inference."

    Transition weights are fitted directly from successful rollouts on the evaluation benchmark distribution; the 161 pairs are then located and filtered from trajectories that use those same weights, so the DPO gain on the identical benchmark partly recycles the fitted statistics rather than deriving an independent improvement.

full rationale

The paper's central improvement chain (ToolGraph topology + rollout-derived transition weights → 161 divergence-point pairs → DPO under ToolGraph context) re-uses successful rollouts from the same tau2-bench distribution both to fit the weights and to construct the preference data. This matches the fitted-input-called-prediction pattern at the level of the reported +0.017 absolute gain, but the schema-derived topology and action-correctness filtering retain independent content, so the circularity is partial rather than total. No self-citation load-bearing or self-definitional reductions are present in the supplied text.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the listed items are the only quantities the abstract explicitly ties to the central performance claim.

free parameters (1)
  • transition weights
    Estimated from successful rollouts on the benchmark; used to weight edges in ToolGraph.

pith-pipeline@v0.9.1-grok · 5728 in / 1311 out tokens · 29437 ms · 2026-06-26T09:00:27.715662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan

  2. [2]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    tau2-bench: Evaluating conversational agents in a dual-control environ- ment. arXiv:2506.07982 [cs.AI]

  3. [3]

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. 2024. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, 7893–7931

  4. [4]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Informa- tion Processing Systems, Vol. 36. Curran Associates, Inc., New Orleans, LA, USA, 10088–10115

  5. [5]

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, and Zaiqiao Meng. 2025. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems. arXiv:2508.07407 [cs.AI]

  6. [6]

    Proceedings of the 29th Symposium on Operating Systems Principles , pages =

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. Association for Computing Machinery, 611–626. doi:10.1145/3600006.3613165

  7. [7]

    Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srini- vasa, Gaowen Liu, Ali Payani, and Chitta Baral. 2025. How Can Input Reformu- lation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on tau-bench. arXiv:2508.20931 [cs.AI]

  8. [8]

    Qwen Team. 2026. Qwen/Qwen3.5-9B. https://huggingface.co/Qwen/Qwen3.5- 9B

  9. [9]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc., New Orleans, LA, USA, 53728–53741

  10. [10]

    Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, and Victor Bar- res. 2026. tau-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge. arXiv:2603.04370 [cs.AI]

  11. [11]

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Re- inforcement Learning. arXiv:2303.11366 [cs.AI] Advances in Neural Information Processing Systems

  12. [12]

    Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. 2024. Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 7584–7600. doi:10.18653/v...

  13. [13]

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. 2025. EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle. arXiv:2510.16079 [cs.AI]

  14. [14]

    Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. 2025. EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience. arXiv:2601.15876 [cs.AI]

  15. [15]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc., New Orleans, LA, USA, 11809–11822

  16. [16]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] International Conference on Learning Repre- sentations. 7