Recognition: 2 theorem links
· Lean TheoremToolRL: Reward is All Tool Learning Needs
Pith reviewed 2026-05-14 00:21 UTC · model grok-4.3
The pith
A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through exhaustive exploration of reward strategies for tool selection and application tasks, the authors derive a principled reward design and apply it within Group Relative Policy Optimization training. The resulting models exhibit robust and stable learning that improves tool-use performance, delivering a 17 percent improvement over base models and a 15 percent improvement over supervised fine-tuning models across diverse benchmarks.
What carries the argument
Group Relative Policy Optimization driven by a reward signal that supplies fine-grained feedback on tool selection, parameter correctness, and task outcome.
If this is right
- Reward design determines whether RL training succeeds or fails at tool-use generalization.
- Fine-grained rewards on tool choice and parameters enable stable scaling beyond what SFT achieves.
- RL training with the proposed rewards produces more reliable handling of unfamiliar tool sequences.
- Insights on reward granularity and timing transfer directly to other multi-tool agent settings.
Where Pith is reading between the lines
- The same reward principles could be tested on longer-horizon tasks such as multi-step web navigation or code repository editing.
- Models trained this way might integrate new third-party APIs with less additional data than SFT requires.
- Production systems could shift from SFT to RL pipelines once reward templates for common tool categories are standardized.
Load-bearing premise
The reward strategies and design principles identified in the study will continue to work for tool-use scenarios that differ from the specific benchmarks and tool sets tested.
What would settle it
Retrain the same models on a new benchmark containing previously unseen tools and tool APIs; if the performance advantage over SFT models shrinks to zero or reverses, the central claim fails.
read the original abstract
Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ToolRL, a reinforcement learning approach using Group Relative Policy Optimization (GRPO) to train LLMs for tool selection and application. It systematically explores reward strategies differing in type, scale, granularity, and temporal dynamics, then proposes a principled reward design claimed to address the limitations of coarse-grained signals such as answer matching. Empirical results on diverse benchmarks are reported to show a 17% improvement over base models and a 15% gain over SFT models, with code released for reproducibility.
Significance. If the reward formulation can be reproduced and the gains are shown to stem from the proposed design rather than training dynamics or benchmark tuning, the work would provide a useful empirical contribution to RL-based tool learning. The emphasis on fine-grained feedback for multi-tool calls and the code release are strengths that could support follow-up studies.
major comments (2)
- [Abstract] Abstract and Methods: No explicit equations, weighting scheme, or functional form is provided for the proposed principled reward (e.g., how tool-selection accuracy, parameter correctness, and answer match are combined or scaled). Without this, the attribution of the 17% and 15% gains to the reward design rather than GRPO or other factors cannot be verified or ablated.
- [Experiments] Experimental section: The manuscript does not report whether baselines were matched for compute budget, whether tool sets and invocation formats were held constant across comparisons, or whether the reported percentage gains survive multiple-testing correction. These details are load-bearing for the central empirical claim.
minor comments (1)
- Notation for reward components and temporal dynamics should be defined consistently with standard RL terminology to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the paper to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods: No explicit equations, weighting scheme, or functional form is provided for the proposed principled reward (e.g., how tool-selection accuracy, parameter correctness, and answer match are combined or scaled). Without this, the attribution of the 17% and 15% gains to the reward design rather than GRPO or other factors cannot be verified or ablated.
Authors: We agree that the explicit functional form was not provided. In the revised manuscript we add the full equations for the principled reward, including the precise weighting scheme and scaling that combine tool-selection accuracy, parameter correctness, and answer match. This enables direct verification and ablation of the reported gains. revision: yes
-
Referee: [Experiments] Experimental section: The manuscript does not report whether baselines were matched for compute budget, whether tool sets and invocation formats were held constant across comparisons, or whether the reported percentage gains survive multiple-testing correction. These details are load-bearing for the central empirical claim.
Authors: We acknowledge these details were omitted. The revised experimental section now explicitly states that all baselines used matched compute budgets, identical tool sets and invocation formats, and that the percentage gains remain significant after Bonferroni correction for multiple testing. revision: yes
Circularity Check
No circularity: empirical gains measured on held-out benchmarks
full rationale
The paper reports measured performance improvements (17% over base, 15% over SFT) from applying a proposed reward design via GRPO on diverse benchmarks. No derivation chain, equations, or self-citation reduces these results to the reward definition by construction; the outcomes are external empirical observations rather than tautological. The reward formulation gap noted by the skeptic is a reproducibility issue, not a circularity reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Group Relative Policy Optimization produces stable policy updates for the tool-use MDP
Lean theorems connected to this paper
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclearEmpirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models.
Forward citations
Cited by 21 Pith papers
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
-
Tools as Continuous Flow for Evolving Agentic Reasoning
FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-...
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
-
Evaluating Plan Compliance in Autonomous Programming Agents
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
-
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
RVPO: Risk-Sensitive Alignment via Variance Regularization
RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
-
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
LLMs often misalign their self-perceived need for tools with true need and utility, but lightweight estimators trained on hidden states can improve tool-calling decisions and task performance across multiple models and tasks.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
-
The Cartesian Cut in Agentic AI
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
-
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
-
Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA
A pipeline of dataset construction from prior work, AugFC parameter augmentation, and two-step LLM training improves function calling for financial APIs and is running in production.
Reference graph
Works this paper leans on
-
[1]
Can a single model master both multi-turn conversations and tool use? coalm: A uni- fied conversational agentic language model. Preprint, arXiv:2502.08820. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang
-
[2]
arXiv preprint arXiv:2404.07738
Researchagent: Iter- ative research idea generation over scientific liter- ature with large language models. arXiv preprint arXiv:2404.07738. Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Col- lier, Karthik Narasimhan, and Shunyu Yao. 2023a. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915. Nuo Chen, Hongguang Li, Baoyuan Wang, a...
-
[3]
Program of thoughts prompting: Disentangling computation from reason- ing for numerical reasoning tasks. arXiv preprint arXiv:2211.12588. Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Agent-FLAN: Designing data and methods of effective agent tuning for large language models. In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand. Association for Computational Linguistics. Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Sheng- bang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levin...
work page 2024
-
[5]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Sft mem- orizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161. Quy-Anh Dang and Chris Ngo
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
arXiv preprint arXiv:2503.16219
Reinforcement learning for reasoning in small llms: What works and what doesn’t. arXiv preprint arXiv:2503.16219. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al
-
[7]
The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2309.17452
Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: In- centivizing reasoning capability in llms via reinforce- ment learning. arXiv preprint arXiv:2501.12948. Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
arXiv preprint arXiv:2311.12871
An em- bodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871. Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu
-
[11]
Yoshitaka Inoue, Tianci Song, and Tianfan Fu
O1 replication journey–part 2: Surpassing o1-preview through sim- ple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489. Yoshitaka Inoue, Tianci Song, and Tianfan Fu
-
[12]
arXiv preprint arXiv:2408.13378
Drugagent: Explainable drug repurposing agent with large language model-based reasoning. arXiv preprint arXiv:2408.13378. Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han
-
[13]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Search- r1: Training llms to reason and leverage search en- gines with reinforcement learning. arXiv preprint arXiv:2503.09516. Minki Kang, Jongwon Jeong, and Jaewoong Cho
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2504.04718
T1: Tool-integrated self-verification for test-time compute scaling in small language models. arXiv preprint arXiv:2504.04718. Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier
-
[15]
arXiv preprint arXiv:2312.14925
A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925. Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Salman Khan, and Fahad Shahbaz Khan
-
[16]
arXiv preprint arXiv:2502.21321
Llm post-training: A deep dive into reasoning large lan- guage models. arXiv preprint arXiv:2502.21321. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li
-
[17]
Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116. Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025a. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886. Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025b. Torl: Scaling tool-int...
-
[18]
arXiv preprint arXiv:2401.08190
Mario: Math reasoning with code interpreter output–a reproducible pipeline. arXiv preprint arXiv:2401.08190. Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al
-
[19]
arXiv preprint arXiv:2410.04587
Ham- mer: Robust function-calling for on-device lan- guage models via function masking. arXiv preprint arXiv:2410.04587. Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al
-
[20]
arXiv preprint arXiv:2305.18703
Do- main specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703. Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al
-
[21]
arXiv preprint arXiv:2409.00920
Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920. Yu Meng, Mengzhou Xia, and Danqi Chen
-
[22]
Gorilla: Large Language Model Connected with Massive APIs
Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2210.03350
Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350. Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji
-
[24]
arXiv preprint arXiv:2502.11435
Smart: Self-aware agent for tool overuse mitigation. arXiv preprint arXiv:2502.11435. Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji
-
[25]
In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 6922–6939
Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 6922–6939. Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xi- aocheng Yang, Denghui Zhang, et al. 2024a. Es- capebench: Pushing...
-
[26]
arXiv preprint arXiv:2410.18982. Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al
-
[27]
arXiv preprint arXiv.2304.08354,
Tool learning with foundation models. arXiv preprint arXiv.2304.08354,
-
[28]
arXiv preprint arXiv:2405.17631
Biodiscov- eryagent: An ai agent for designing genetic perturba- tion experiments. arXiv preprint arXiv:2405.17631. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom
-
[29]
Proximal Policy Optimization Algorithms
Proxi- mal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Vlm- r1: A stable and generalizable r1-style large vision- language model. arXiv preprint arXiv:2504.07615. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
HybridFlow: A Flexible and Efficient RLHF Framework
Hybridflow: A flex- ible and efficient rlhf framework. arXiv preprint arXiv:2409.19256. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599. Qwen Team
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
arXiv preprint arXiv:2310.03214
Freshllms: Refreshing large language models with search engine augmenta- tion. arXiv preprint arXiv:2310.03214. Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Qingsong Wen, Wei Ye, et al
-
[36]
arXiv preprint arXiv:2502.14768
Logic-rl: Un- leashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao
-
[37]
arXiv preprint arXiv:2308.12519
Rational decision-making agent with internalized utility judgment. arXiv preprint arXiv:2308.12519. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al
-
[38]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Dapo: An open- source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, and Min Zhang
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
arXiv preprint arXiv:2410.07745
Steptool: A step-grained reinforcement learning framework for tool learning in llms. arXiv preprint arXiv:2410.07745. Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al
-
[40]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Vapo: Efficient and reliable reinforcement learn- ing for advanced reasoning tasks. arXiv preprint arXiv:2504.05118. Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
AgentTun- ing: Enabling generalized agent abilities for LLMs. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 3053–3077, Bangkok, Thailand. Association for Computational Linguistics. Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, and Huaimin Wang
work page 2024
-
[42]
arXiv preprint arXiv:2409.09345
En- hancing decision-making for llm agents via step-level q-value models. arXiv preprint arXiv:2409.09345. Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang Gan
-
[43]
arXiv preprint arXiv:2307.02485
Building cooperative em- bodied agents modularly with large language models. arXiv preprint arXiv:2307.02485. Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al
-
[44]
arXiv preprint arXiv:2409.03215
xlam: A family of large action models to empower ai agent systems. arXiv preprint arXiv:2409.03215. Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu
-
[45]
arXiv preprint arXiv:2504.03160
Deepresearcher: Scaling deep research via reinforce- ment learning in real-world environments. arXiv preprint arXiv:2504.03160. Appendix A User Prompt Details The system instruction is shown in Figure
-
[46]
B Experiment Details Training Data Details. We empirically use 4K data points for training, as each dataset consists of samples drawn from the same distribution. Adding more data of similar nature does not increase task diversity. Moreover, we observe that increasing the dataset size beyond 4K does not yield noticeable improvements in the training converg...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.