Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

· 2026 · cs.LG · arXiv 2603.24709

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multi-step tool orchestration remains challenging for LLMs, as state-of-the-art models frequently fail on full sequence execution due to parameter errors. Training for these workflows faces two obstacles: the lack of environments supporting complex real-world API dependencies, and sparse binary rewards that provide no signal for partial correctness. We propose a reinforcement learning framework addressing both challenges. First, we construct a deterministic environment backed by a large-scale cache of real API responses, enabling constrained synthesis of valid multi-step traces with controllable complexity. Second, we introduce a graduated reward that decomposes correctness into atomic validity (call-level correctness at increasing granularity) and orchestration consistency (correct sequencing with dependency respect). On ComplexFuncBench, our approach substantially improves turn accuracy, with ablations confirming both reward components are essential. Cross-benchmark evaluation on BFCL v4 shows that the learned orchestration skills transfer to entirely different API ecosystems (e.g., agentic web search and memory management), yielding consistent gains while maintaining stable single-step performance. Code is available at https://github.com/horizon-rl/ToolOrchestrationReward

representative citing papers

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments cs.CL · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
PROVE trains LLMs on multi-step tool calls using 20 live MCP servers with 343 tools, state-grounded synthesis, and adaptive efficiency rewards, delivering gains of up to 10.2 points on BFCL Multi-Turn and similar on other benchmarks.

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

fields

years

verdicts

representative citing papers

citing papers explorer