pith. machine review for the scientific record. sign in

arxiv: 2504.13958 · v1 · submitted 2025-04-16 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Dilek Hakkani-T\"ur, Emre Can Acikgoz, Gokhan Tur, Heng Ji, Hongru Wang, Qi He, Xiusi Chen

Pith reviewed 2026-05-14 00:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords tool learningreward designreinforcement learninglarge language modelstool usegeneralizationGRPO
0
0 comments X

The pith

A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines reward design inside reinforcement learning for teaching large language models to select and apply tools. Supervised fine-tuning often fails to generalize when tools are new or tasks grow complex, while coarse answer-matching rewards give too little guidance for learning. The authors test many reward types, scales, and timings, then introduce a tailored reward that supplies fine-grained signals on tool choice and parameter use. They train models with Group Relative Policy Optimization and observe 17 percent gains over base models plus 15 percent gains over SFT on multiple benchmarks. A reader would care because clearer rewards could turn LLMs into more reliable agents that handle real tool sequences without constant human correction.

Core claim

Through exhaustive exploration of reward strategies for tool selection and application tasks, the authors derive a principled reward design and apply it within Group Relative Policy Optimization training. The resulting models exhibit robust and stable learning that improves tool-use performance, delivering a 17 percent improvement over base models and a 15 percent improvement over supervised fine-tuning models across diverse benchmarks.

What carries the argument

Group Relative Policy Optimization driven by a reward signal that supplies fine-grained feedback on tool selection, parameter correctness, and task outcome.

If this is right

  • Reward design determines whether RL training succeeds or fails at tool-use generalization.
  • Fine-grained rewards on tool choice and parameters enable stable scaling beyond what SFT achieves.
  • RL training with the proposed rewards produces more reliable handling of unfamiliar tool sequences.
  • Insights on reward granularity and timing transfer directly to other multi-tool agent settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward principles could be tested on longer-horizon tasks such as multi-step web navigation or code repository editing.
  • Models trained this way might integrate new third-party APIs with less additional data than SFT requires.
  • Production systems could shift from SFT to RL pipelines once reward templates for common tool categories are standardized.

Load-bearing premise

The reward strategies and design principles identified in the study will continue to work for tool-use scenarios that differ from the specific benchmarks and tool sets tested.

What would settle it

Retrain the same models on a new benchmark containing previously unseen tools and tool APIs; if the performance advantage over SFT models shrinks to zero or reverses, the central claim fails.

read the original abstract

Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ToolRL, a reinforcement learning approach using Group Relative Policy Optimization (GRPO) to train LLMs for tool selection and application. It systematically explores reward strategies differing in type, scale, granularity, and temporal dynamics, then proposes a principled reward design claimed to address the limitations of coarse-grained signals such as answer matching. Empirical results on diverse benchmarks are reported to show a 17% improvement over base models and a 15% gain over SFT models, with code released for reproducibility.

Significance. If the reward formulation can be reproduced and the gains are shown to stem from the proposed design rather than training dynamics or benchmark tuning, the work would provide a useful empirical contribution to RL-based tool learning. The emphasis on fine-grained feedback for multi-tool calls and the code release are strengths that could support follow-up studies.

major comments (2)
  1. [Abstract] Abstract and Methods: No explicit equations, weighting scheme, or functional form is provided for the proposed principled reward (e.g., how tool-selection accuracy, parameter correctness, and answer match are combined or scaled). Without this, the attribution of the 17% and 15% gains to the reward design rather than GRPO or other factors cannot be verified or ablated.
  2. [Experiments] Experimental section: The manuscript does not report whether baselines were matched for compute budget, whether tool sets and invocation formats were held constant across comparisons, or whether the reported percentage gains survive multiple-testing correction. These details are load-bearing for the central empirical claim.
minor comments (1)
  1. Notation for reward components and temporal dynamics should be defined consistently with standard RL terminology to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the paper to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: No explicit equations, weighting scheme, or functional form is provided for the proposed principled reward (e.g., how tool-selection accuracy, parameter correctness, and answer match are combined or scaled). Without this, the attribution of the 17% and 15% gains to the reward design rather than GRPO or other factors cannot be verified or ablated.

    Authors: We agree that the explicit functional form was not provided. In the revised manuscript we add the full equations for the principled reward, including the precise weighting scheme and scaling that combine tool-selection accuracy, parameter correctness, and answer match. This enables direct verification and ablation of the reported gains. revision: yes

  2. Referee: [Experiments] Experimental section: The manuscript does not report whether baselines were matched for compute budget, whether tool sets and invocation formats were held constant across comparisons, or whether the reported percentage gains survive multiple-testing correction. These details are load-bearing for the central empirical claim.

    Authors: We acknowledge these details were omitted. The revised experimental section now explicitly states that all baselines used matched compute budgets, identical tool sets and invocation formats, and that the percentage gains remain significant after Bonferroni correction for multiple testing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out benchmarks

full rationale

The paper reports measured performance improvements (17% over base, 15% over SFT) from applying a proposed reward design via GRPO on diverse benchmarks. No derivation chain, equations, or self-citation reduces these results to the reward definition by construction; the outcomes are external empirical observations rather than tautological. The reward formulation gap noted by the skeptic is a reproducibility issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions (Markov decision process, policy gradient validity) plus the empirical observation that the chosen reward formulation produces stable training; no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Group Relative Policy Optimization produces stable policy updates for the tool-use MDP
    Invoked when the authors apply GRPO to the tool-calling environment without additional proof.

pith-pipeline@v0.9.0 · 5552 in / 1258 out tokens · 26291 ms · 2026-05-14T00:21:51.946869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.

  2. Tools as Continuous Flow for Evolving Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-...

  3. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  4. Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

    cs.CL 2026-04 unverdicted novelty 7.0

    DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.

  5. R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

    cs.LG 2026-04 unverdicted novelty 7.0

    R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.

  6. Evaluating Plan Compliance in Autonomous Programming Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...

  7. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  8. PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

    cs.AI 2026-04 unverdicted novelty 7.0

    PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

  9. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  10. PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.

  11. TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

  12. SOD: Step-wise On-policy Distillation for Small Language Model Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.

  13. RVPO: Risk-Sensitive Alignment via Variance Regularization

    cs.LG 2026-05 unverdicted novelty 6.0

    RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.

  14. To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs often misalign their self-perceived need for tools with true need and utility, but lightweight estimators trained on hidden states can improve tool-calling decisions and task performance across multiple models and tasks.

  15. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.

  16. See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

    cs.CV 2026-04 unverdicted novelty 6.0

    ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.

  17. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  18. AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.

  19. The Cartesian Cut in Agentic AI

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.

  20. Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents

    cs.MA 2026-05 unverdicted novelty 4.0

    The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.

  21. Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA

    cs.IR 2026-04 unverdicted novelty 3.0

    A pipeline of dataset construction from prior work, AugFC parameter augmentation, and two-step LLM training improves function calling for financial APIs and is running in production.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 20 Pith papers · 14 internal anchors

  1. [1]

    Preprint, arXiv:2502.08820

    Can a single model master both multi-turn conversations and tool use? coalm: A uni- fied conversational agentic language model. Preprint, arXiv:2502.08820. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang

  2. [2]

    arXiv preprint arXiv:2404.07738

    Researchagent: Iter- ative research idea generation over scientific liter- ature with large language models. arXiv preprint arXiv:2404.07738. Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Col- lier, Karthik Narasimhan, and Shunyu Yao. 2023a. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915. Nuo Chen, Hongguang Li, Baoyuan Wang, a...

  3. [3]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Program of thoughts prompting: Disentangling computation from reason- ing for numerical reasoning tasks. arXiv preprint arXiv:2211.12588. Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao

  4. [4]

    In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand

    Agent-FLAN: Designing data and methods of effective agent tuning for large language models. In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand. Association for Computational Linguistics. Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Sheng- bang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levin...

  5. [5]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Sft mem- orizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161. Quy-Anh Dang and Chris Ngo

  6. [6]

    arXiv preprint arXiv:2503.16219

    Reinforcement learning for reasoning in small llms: What works and what doesn’t. arXiv preprint arXiv:2503.16219. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al

  7. [7]

    The Llama 3 Herd of Models

    The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen

  8. [8]

    arXiv preprint arXiv:2309.17452

    Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: In- centivizing reasoning capability in llms via reinforce- ment learning. arXiv preprint arXiv:2501.12948. Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang

  10. [10]

    arXiv preprint arXiv:2311.12871

    An em- bodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871. Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu

  11. [11]

    Yoshitaka Inoue, Tianci Song, and Tianfan Fu

    O1 replication journey–part 2: Surpassing o1-preview through sim- ple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489. Yoshitaka Inoue, Tianci Song, and Tianfan Fu

  12. [12]

    arXiv preprint arXiv:2408.13378

    Drugagent: Explainable drug repurposing agent with large language model-based reasoning. arXiv preprint arXiv:2408.13378. Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han

  13. [13]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Search- r1: Training llms to reason and leverage search en- gines with reinforcement learning. arXiv preprint arXiv:2503.09516. Minki Kang, Jongwon Jeong, and Jaewoong Cho

  14. [14]

    arXiv preprint arXiv:2504.04718

    T1: Tool-integrated self-verification for test-time compute scaling in small language models. arXiv preprint arXiv:2504.04718. Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier

  15. [15]

    arXiv preprint arXiv:2312.14925

    A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925. Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Salman Khan, and Fahad Shahbaz Khan

  16. [16]

    arXiv preprint arXiv:2502.21321

    Llm post-training: A deep dive into reasoning large lan- guage models. arXiv preprint arXiv:2502.21321. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li

  17. [17]

    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116

    Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116. Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025a. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886. Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025b. Torl: Scaling tool-int...

  18. [18]

    arXiv preprint arXiv:2401.08190

    Mario: Math reasoning with code interpreter output–a reproducible pipeline. arXiv preprint arXiv:2401.08190. Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al

  19. [19]

    arXiv preprint arXiv:2410.04587

    Ham- mer: Robust function-calling for on-device lan- guage models via function masking. arXiv preprint arXiv:2410.04587. Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al

  20. [20]

    arXiv preprint arXiv:2305.18703

    Do- main specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703. Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al

  21. [21]

    arXiv preprint arXiv:2409.00920

    Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920. Yu Meng, Mengzhou Xia, and Danqi Chen

  22. [22]

    Gorilla: Large Language Model Connected with Massive APIs

    Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez

  23. [23]

    arXiv preprint arXiv:2210.03350

    Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350. Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji

  24. [24]

    arXiv preprint arXiv:2502.11435

    Smart: Self-aware agent for tool overuse mitigation. arXiv preprint arXiv:2502.11435. Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

  25. [25]

    In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 6922–6939

    Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 6922–6939. Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xi- aocheng Yang, Denghui Zhang, et al. 2024a. Es- capebench: Pushing...

  26. [26]

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al

    arXiv preprint arXiv:2410.18982. Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al

  27. [27]

    arXiv preprint arXiv.2304.08354,

    Tool learning with foundation models. arXiv preprint arXiv.2304.08354,

  28. [28]

    arXiv preprint arXiv:2405.17631

    Biodiscov- eryagent: An ai agent for designing genetic perturba- tion experiments. arXiv preprint arXiv:2405.17631. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom

  29. [29]

    Proximal Policy Optimization Algorithms

    Proxi- mal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al

  31. [31]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Vlm- r1: A stable and generalizable r1-style large vision- language model. arXiv preprint arXiv:2504.07615. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu

  32. [32]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Hybridflow: A flex- ible and efficient rlhf framework. arXiv preprint arXiv:2409.19256. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen

  33. [33]

    R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

    R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al

  34. [34]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599. Qwen Team

  35. [35]

    arXiv preprint arXiv:2310.03214

    Freshllms: Refreshing large language models with search engine augmenta- tion. arXiv preprint arXiv:2310.03214. Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Qingsong Wen, Wei Ye, et al

  36. [36]

    arXiv preprint arXiv:2502.14768

    Logic-rl: Un- leashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao

  37. [37]

    arXiv preprint arXiv:2308.12519

    Rational decision-making agent with internalized utility judgment. arXiv preprint arXiv:2308.12519. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al

  38. [38]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Dapo: An open- source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, and Min Zhang

  39. [39]

    arXiv preprint arXiv:2410.07745

    Steptool: A step-grained reinforcement learning framework for tool learning in llms. arXiv preprint arXiv:2410.07745. Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al

  40. [40]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Vapo: Efficient and reliable reinforcement learn- ing for advanced reasoning tasks. arXiv preprint arXiv:2504.05118. Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang

  41. [41]

    In Findings of the Association for Computational Linguistics: ACL 2024 , pages 3053–3077, Bangkok, Thailand

    AgentTun- ing: Enabling generalized agent abilities for LLMs. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 3053–3077, Bangkok, Thailand. Association for Computational Linguistics. Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, and Huaimin Wang

  42. [42]

    arXiv preprint arXiv:2409.09345

    En- hancing decision-making for llm agents via step-level q-value models. arXiv preprint arXiv:2409.09345. Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang Gan

  43. [43]

    arXiv preprint arXiv:2307.02485

    Building cooperative em- bodied agents modularly with large language models. arXiv preprint arXiv:2307.02485. Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al

  44. [44]

    arXiv preprint arXiv:2409.03215

    xlam: A family of large action models to empower ai agent systems. arXiv preprint arXiv:2409.03215. Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu

  45. [45]

    arXiv preprint arXiv:2504.03160

    Deepresearcher: Scaling deep research via reinforce- ment learning in real-world environments. arXiv preprint arXiv:2504.03160. Appendix A User Prompt Details The system instruction is shown in Figure

  46. [46]

    theoretical

    B Experiment Details Training Data Details. We empirically use 4K data points for training, as each dataset consists of samples drawn from the same distribution. Adding more data of similar nature does not increase task diversity. Moreover, we observe that increasing the dataset size beyond 4K does not yield noticeable improvements in the training converg...