arxiv: 2603.24709 · v2 · submitted 2026-03-25 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Bing Yin, Chao Zhang, Cheng Jiayang, Haoyang Wen, Priyanka Nigam, Qingyu Yin, Shiyang Li, Xin Liu, Yangqiu Song, Zhihan Zhang, Zixuan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM tool usemulti-step orchestrationreinforcement learninggraduated rewardsconstrained data synthesisComplexFuncBenchBFCL

0 comments

The pith

A reinforcement learning setup using cached real API responses and graduated rewards lets LLMs execute multi-step tool sequences more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard LLMs often fail on full sequences of tool calls because parameter mistakes receive only binary feedback that gives no guidance on partial progress. The paper builds a deterministic environment from a large cache of real API responses so that valid multi-step traces can be synthesized at controlled levels of complexity. Inside this environment it trains with a reward that separately scores atomic validity of each call at several granularities and overall orchestration consistency that respects call dependencies. The resulting models raise turn accuracy on ComplexFuncBench and the gains transfer to the unrelated API ecosystem of BFCL v4 while single-step performance stays stable.

Core claim

Training LLMs for multi-step tool orchestration with a deterministic environment backed by cached real API responses and a graduated reward that decomposes correctness into atomic validity and orchestration consistency substantially improves turn accuracy on ComplexFuncBench, with both reward components required, and the learned skills transfer to BFCL v4 yielding consistent gains while preserving single-step performance.

What carries the argument

The graduated reward that scores atomic validity of individual calls at increasing levels of granularity together with orchestration consistency that enforces correct sequencing and dependency respect.

Load-bearing premise

The large-scale cache of real API responses is sufficient to create valid multi-step traces that capture the dependency structure and error patterns of live, dynamic APIs.

What would settle it

If models trained this way show no accuracy gain or lose stability when evaluated on a fresh set of APIs whose response distributions are not covered by the cache, the claim that the cached environment produces transferable orchestration skills would be falsified.

Figures

Figures reproduced from arXiv: 2603.24709 by Bing Yin, Chao Zhang, Cheng Jiayang, Haoyang Wen, Priyanka Nigam, Qingyu Yin, Shiyang Li, Xin Liu, Yangqiu Song, Zhihan Zhang, Zixuan Zhang.

**Figure 2.** Figure 2: Training dynamics under different reward configurations. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Turn Accuracy (%) stratified by (a) dependency depth and (b) dependency pattern. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Breakdown of Ratomic into AST validation (static) and semantic validation (execution). Rorch only training maintains RAST but Rsem collapses, indicating syntactically valid but semantically broken calls [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Tool calling success metrics during training. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Training vs validation metrics for the Combined model. All three metrics ( [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Multi-step tool orchestration remains challenging for LLMs, as state-of-the-art models frequently fail on full sequence execution due to parameter errors. Training for these workflows faces two obstacles: the lack of environments supporting complex real-world API dependencies, and sparse binary rewards that provide no signal for partial correctness. We propose a reinforcement learning framework addressing both challenges. First, we construct a deterministic environment backed by a large-scale cache of real API responses, enabling constrained synthesis of valid multi-step traces with controllable complexity. Second, we introduce a graduated reward that decomposes correctness into atomic validity (call-level correctness at increasing granularity) and orchestration consistency (correct sequencing with dependency respect). On ComplexFuncBench, our approach substantially improves turn accuracy, with ablations confirming both reward components are essential. Cross-benchmark evaluation on BFCL v4 shows that the learned orchestration skills transfer to entirely different API ecosystems (e.g., agentic web search and memory management), yielding consistent gains while maintaining stable single-step performance. Code is available at https://github.com/horizon-rl/ToolOrchestrationReward

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows gains on tool chaining benchmarks from a cached deterministic API environment plus a reward split into atomic validity and orchestration consistency, with some transfer to a different API set.

read the letter

The paper's main takeaway is that a cached deterministic API environment plus a two-part graduated reward can train LLMs to handle multi-step tool calls better and transfer the skill to new API sets. They build a deterministic environment from a large cache of real API responses. This lets them synthesize valid multi-step traces with controllable complexity without needing live calls during training. The reward splits into atomic validity, which checks call-level correctness at different granularities, and orchestration consistency, which looks at sequencing and dependencies. On ComplexFuncBench this leads to better turn accuracy, and ablations show both parts of the reward are needed. The skills also transfer to BFCL v4, which has different APIs like web search and memory management, with gains there too and no drop in single-step performance. This is a practical step forward for training agentic LLMs. The code release helps with reproducibility, and the cross-benchmark test is a good check for whether the method generalizes beyond one setup. The approach directly tackles the sparse reward issue that makes RL hard for these tasks. The main concern is whether the cache really stands in for live APIs. The environment is deterministic by design, so it might miss transient errors, authentication issues, or rate limiting that happen in real use. If the synthesized traces don't match the error patterns and state changes of dynamic APIs, the learned policy could overfit to the easier cached version. The abstract claims the cache captures these elements, but without specific metrics on coverage or mismatch rates, it's hard to know how much the transfer results depend on that assumption holding. The lack of quantitative details in the summary also makes it tough to judge the size of the improvements or their statistical reliability. This work is aimed at researchers building and training LLM-based agents that need to chain tools reliably. Readers working on RL for tool use or agent orchestration will find the synthesis method and reward design worth looking at. It deserves a serious referee because the experiments include ablations and transfer tests, the code is available, and the problem is relevant, even though the cache assumption will need close examination during review.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a reinforcement learning framework for training LLMs on multi-step tool orchestration. It constructs a deterministic environment backed by a large-scale cache of real API responses to enable constrained synthesis of valid multi-step traces with controllable complexity, and introduces a graduated reward decomposing correctness into atomic validity (call-level at increasing granularity) and orchestration consistency (sequencing with dependency respect). On ComplexFuncBench the method yields substantial turn accuracy gains with ablations confirming both reward components are essential; cross-benchmark evaluation on BFCL v4 shows transfer of orchestration skills to different API ecosystems while preserving single-step performance. Code is released.

Significance. If the results hold, the work offers a concrete path to denser reward signals and scalable environment construction for training reliable multi-step tool-use agents, addressing two persistent obstacles in LLM agent research. The explicit decomposition of the reward and the demonstration of cross-ecosystem transfer are potentially valuable contributions; the code release further strengthens the paper by enabling direct reproduction and extension.

major comments (1)

[Environment Construction] Environment Construction section: the central transfer claim on BFCL v4 rests on the assumption that the static cache reproduces live API error patterns, transient failures, rate-limit responses, and state changes closely enough for the graduated reward to produce generalizable policies. No quantitative coverage metrics (fraction of observed live error types reproduced, distributional similarity between cached and live traces) are reported, leaving open the possibility that the deterministic surrogate is easier than real dynamic APIs and that the reported gains are artifacts of this mismatch.

minor comments (2)

[Abstract] Abstract: reports 'substantial improvements' and 'consistent gains' without any numerical values, baseline comparisons, or error bars, which reduces immediate readability of the magnitude of the claimed advances.
[Graduated Reward] Reward definition: the exact weighting scheme balancing atomic validity and orchestration consistency, as well as the precise granularity levels used in the atomic component, should be stated explicitly (ideally with pseudocode or equations) to support replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the presentation of our environment construction and transfer claims. We respond to the major comment below.

read point-by-point responses

Referee: [Environment Construction] Environment Construction section: the central transfer claim on BFCL v4 rests on the assumption that the static cache reproduces live API error patterns, transient failures, rate-limit responses, and state changes closely enough for the graduated reward to produce generalizable policies. No quantitative coverage metrics (fraction of observed live error types reproduced, distributional similarity between cached and live traces) are reported, leaving open the possibility that the deterministic surrogate is easier than real dynamic APIs and that the reported gains are artifacts of this mismatch.

Authors: We agree that quantitative coverage metrics would strengthen the manuscript. The cache was populated via systematic real API executions across diverse parameter spaces and conditions to capture authentic error patterns, rate-limit behaviors, and state-dependent responses. In the revision we will expand the Environment Construction section with explicit metrics, including the fraction of live error types reproduced in the cache and distributional similarity measures (such as KL divergence or embedding-based comparisons) between cached and live traces, computed from additional validation sampling. These additions will directly support the cross-benchmark transfer results on BFCL v4. We maintain that the deterministic surrogate does not artificially inflate performance, because the graduated reward is computed against the exact cached real responses and the ablations isolate the contribution of the reward components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains on held-out benchmarks with independent evaluation

full rationale

The paper's central claims are measured improvements in turn accuracy on ComplexFuncBench and transfer gains on BFCL v4, supported by ablations on reward components. The method uses a cached deterministic environment for trace synthesis and a decomposed graduated reward, but no equations or derivations reduce the reported performance metrics to quantities defined by the method's own fitted parameters or self-referential definitions. Results are externally benchmarked and falsifiable, with no load-bearing self-citations or ansatz smuggling identified in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that a static cache of API responses can stand in for live execution while preserving dependency structure; no new entities are postulated and the reward components are defined directly from the problem rather than fitted constants.

free parameters (1)

weights balancing atomic validity and orchestration consistency
The graduated reward is described as a decomposition; any scaling between the two components is a tunable hyperparameter not specified in the abstract.

axioms (1)

domain assumption Cached API responses form a deterministic environment that supports controllable multi-step traces with realistic dependencies
Invoked to enable constrained synthesis without live API calls.

pith-pipeline@v0.9.0 · 5525 in / 1306 out tokens · 45913 ms · 2026-05-15T00:11:40.261353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Nestful: A benchmark for evaluating llms on nested sequences of api calls

Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate, Mayank Agarwal, Maxwell Crouse, Yara Rizk, Kelsey Bradford, Asim Munawar, Sadhana Kumaravel, Saurabh Goyal, et al. Nestful: A benchmark for evaluating llms on nested sequences of api calls. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 33526–33535,

2025
[2]

Parl-mt: Learning to call functions in multi- turn conversation with progress awareness.arXiv preprint arXiv:2509.23206,

Huacan Chai, Zijie Cao, Maolin Ran, Yingxuan Yang, Jianghao Lin, Xin Peng, Hairui Wang, Renjie Ding, Ziyu Wan, Muning Wen, et al. Parl-mt: Learning to call functions in multi- turn conversation with progress awareness.arXiv preprint arXiv:2509.23206,

work page arXiv
[3]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv preprint arXiv:2504.11536,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models. InFindings of the Association for Computational Linguistics ACL 2024, pp. 11143–11156,

2024
[5]

Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, et al. Apigen-mt: Agentic pipeline for multi-turn data generation via simulated agent-human interplay. arXiv preprint arXiv:2504.03601,

work page arXiv
[6]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301,

work page arXiv
[10]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

10 Preprint. Under review. Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Seal-tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark

Mengsong Wu, Tong Zhu, Han Han, Chuanyuan Tan, Xiang Zhang, and Wenliang Chen. Seal-tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark. arXiv preprint arXiv:2405.08355,

work page arXiv
[12]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Qwen2.5 Technical Report

Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

URLhttps://api.semanticscholar.org/CorpusID:274859421. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Complexfuncbench: exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132,

Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. Complexfuncbench: exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132,

work page arXiv
[16]

Generate user query

A Data Synthesis Algorithm Algorithm 1Constrained Data Synthesis Pipeline Input:Workflow TemplatesT, CacheC, Generator LLMM Output:Synthetic DatasetD syn Build Inverted IndexI:(f, param, val)→ {cache_ids} foreach templateT= (f 1, . . . ,f n)∈ Tdo Initialize empty traceτ= [] forstept=1tondo ifstepthas dependencies on previous outputsthen Extract required v...

2024