arxiv: 2504.13958 · v1 · submitted 2025-04-16 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Dilek Hakkani-T\"ur, Emre Can Acikgoz, Gokhan Tur, Heng Ji, Hongru Wang, Qi He, Xiusi Chen

Pith reviewed 2026-05-14 00:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords tool learningreward designreinforcement learninglarge language modelstool usegeneralizationGRPO

0 comments

The pith

A principled reward design for tool-use tasks lets reinforcement learning outperform supervised fine-tuning in training LLMs to use tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines reward design inside reinforcement learning for teaching large language models to select and apply tools. Supervised fine-tuning often fails to generalize when tools are new or tasks grow complex, while coarse answer-matching rewards give too little guidance for learning. The authors test many reward types, scales, and timings, then introduce a tailored reward that supplies fine-grained signals on tool choice and parameter use. They train models with Group Relative Policy Optimization and observe 17 percent gains over base models plus 15 percent gains over SFT on multiple benchmarks. A reader would care because clearer rewards could turn LLMs into more reliable agents that handle real tool sequences without constant human correction.

Core claim

Through exhaustive exploration of reward strategies for tool selection and application tasks, the authors derive a principled reward design and apply it within Group Relative Policy Optimization training. The resulting models exhibit robust and stable learning that improves tool-use performance, delivering a 17 percent improvement over base models and a 15 percent improvement over supervised fine-tuning models across diverse benchmarks.

What carries the argument

Group Relative Policy Optimization driven by a reward signal that supplies fine-grained feedback on tool selection, parameter correctness, and task outcome.

If this is right

Reward design determines whether RL training succeeds or fails at tool-use generalization.
Fine-grained rewards on tool choice and parameters enable stable scaling beyond what SFT achieves.
RL training with the proposed rewards produces more reliable handling of unfamiliar tool sequences.
Insights on reward granularity and timing transfer directly to other multi-tool agent settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward principles could be tested on longer-horizon tasks such as multi-step web navigation or code repository editing.
Models trained this way might integrate new third-party APIs with less additional data than SFT requires.
Production systems could shift from SFT to RL pipelines once reward templates for common tool categories are standardized.

Load-bearing premise

The reward strategies and design principles identified in the study will continue to work for tool-use scenarios that differ from the specific benchmarks and tool sets tested.

What would settle it

Retrain the same models on a new benchmark containing previously unseen tools and tool APIs; if the performance advantage over SFT models shrinks to zero or reverses, the central claim fails.

read the original abstract

Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reward design for tool-use RL gets a systematic look and some gains, but the exact formula stays unspecified so the attribution is still shaky.

read the letter

The main thing to know is that this paper runs a broad sweep of reward choices for training LLMs on tool selection and parameter use inside an RL loop, then applies GRPO and reports 17% gains over base models plus 15% over SFT across benchmarks. They release the code, which lets others check the implementation directly. That combination of a targeted study plus open artifacts is the concrete addition here. They correctly flag that answer-matching rewards are too coarse when multiple tools and parameters are involved, and they test variations in type, scale, granularity, and timing to build a more tailored signal. The empirical side shows stable training and better generalization than plain SFT, which matches what people see in practice with agentic setups. The work stays grounded in the actual training dynamics rather than claiming theoretical breakthroughs. The soft spot is the missing functional form. The abstract and stress-test note both leave out the precise weighting or combination rule for tool accuracy, parameter correctness, and final answer match, so it is still unclear whether the reported improvements come from the claimed principles or from GRPO itself, benchmark tuning, or unstated post-processing. Without the equations or clear ablations against the coarse baselines they criticize, reproduction and credit assignment stay difficult. The transfer claim to other tool sets is also light on evidence. This is useful for groups already running RL on tool-augmented models who need a practical starting recipe and can inspect the released code. It is not yet ready for direct citation in new work because the reward details are opaque. A serious editor should send it to referees so the missing formulation, baseline controls, and multi-test corrections can be checked and tightened.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ToolRL, a reinforcement learning approach using Group Relative Policy Optimization (GRPO) to train LLMs for tool selection and application. It systematically explores reward strategies differing in type, scale, granularity, and temporal dynamics, then proposes a principled reward design claimed to address the limitations of coarse-grained signals such as answer matching. Empirical results on diverse benchmarks are reported to show a 17% improvement over base models and a 15% gain over SFT models, with code released for reproducibility.

Significance. If the reward formulation can be reproduced and the gains are shown to stem from the proposed design rather than training dynamics or benchmark tuning, the work would provide a useful empirical contribution to RL-based tool learning. The emphasis on fine-grained feedback for multi-tool calls and the code release are strengths that could support follow-up studies.

major comments (2)

[Abstract] Abstract and Methods: No explicit equations, weighting scheme, or functional form is provided for the proposed principled reward (e.g., how tool-selection accuracy, parameter correctness, and answer match are combined or scaled). Without this, the attribution of the 17% and 15% gains to the reward design rather than GRPO or other factors cannot be verified or ablated.
[Experiments] Experimental section: The manuscript does not report whether baselines were matched for compute budget, whether tool sets and invocation formats were held constant across comparisons, or whether the reported percentage gains survive multiple-testing correction. These details are load-bearing for the central empirical claim.

minor comments (1)

Notation for reward components and temporal dynamics should be defined consistently with standard RL terminology to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the paper to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract and Methods: No explicit equations, weighting scheme, or functional form is provided for the proposed principled reward (e.g., how tool-selection accuracy, parameter correctness, and answer match are combined or scaled). Without this, the attribution of the 17% and 15% gains to the reward design rather than GRPO or other factors cannot be verified or ablated.

Authors: We agree that the explicit functional form was not provided. In the revised manuscript we add the full equations for the principled reward, including the precise weighting scheme and scaling that combine tool-selection accuracy, parameter correctness, and answer match. This enables direct verification and ablation of the reported gains. revision: yes
Referee: [Experiments] Experimental section: The manuscript does not report whether baselines were matched for compute budget, whether tool sets and invocation formats were held constant across comparisons, or whether the reported percentage gains survive multiple-testing correction. These details are load-bearing for the central empirical claim.

Authors: We acknowledge these details were omitted. The revised experimental section now explicitly states that all baselines used matched compute budgets, identical tool sets and invocation formats, and that the percentage gains remain significant after Bonferroni correction for multiple testing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out benchmarks

full rationale

The paper reports measured performance improvements (17% over base, 15% over SFT) from applying a proposed reward design via GRPO on diverse benchmarks. No derivation chain, equations, or self-citation reduces these results to the reward definition by construction; the outcomes are external empirical observations rather than tautological. The reward formulation gap noted by the skeptic is a reproducibility issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions (Markov decision process, policy gradient validity) plus the empirical observation that the chosen reward formulation produces stable training; no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Group Relative Policy Optimization produces stable policy updates for the tool-use MDP
Invoked when the authors apply GRPO to the tool-calling environment without additional proof.

pith-pipeline@v0.9.0 · 5552 in / 1258 out tokens · 26291 ms · 2026-05-14T00:21:51.946869+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
Tools as Continuous Flow for Evolving Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-...
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
cs.LG 2026-04 unverdicted novelty 7.0

R2IF improves LLM function-calling accuracy by up to 34.62% on BFCL using a composite reward system with CER and SMV components optimized via GRPO, while increasing interpretability through positive CoT effectiveness.
Evaluating Plan Compliance in Autonomous Programming Agents
cs.SE 2026-04 unverdicted novelty 7.0

Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
RVPO: Risk-Sensitive Alignment via Variance Regularization
cs.LG 2026-05 unverdicted novelty 6.0

RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
cs.AI 2026-05 unverdicted novelty 6.0

LLMs often misalign their self-perceived need for tools with true need and utility, but lightweight estimators trained on hidden states can improve tool-calling decisions and task performance across multiple models and tasks.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
cs.MA 2026-05 unverdicted novelty 4.0

The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA
cs.IR 2026-04 unverdicted novelty 3.0

A pipeline of dataset construction from prior work, AugFC parameter augmentation, and two-step LLM training improves function calling for financial APIs and is running in production.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 20 Pith papers · 14 internal anchors

[1]

Preprint, arXiv:2502.08820

Can a single model master both multi-turn conversations and tool use? coalm: A uni- fied conversational agentic language model. Preprint, arXiv:2502.08820. Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang

work page arXiv
[2]

arXiv preprint arXiv:2404.07738

Researchagent: Iter- ative research idea generation over scientific liter- ature with large language models. arXiv preprint arXiv:2404.07738. Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Col- lier, Karthik Narasimhan, and Shunyu Yao. 2023a. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915. Nuo Chen, Hongguang Li, Baoyuan Wang, a...

work page arXiv
[3]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Program of thoughts prompting: Disentangling computation from reason- ing for numerical reasoning tasks. arXiv preprint arXiv:2211.12588. Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao

work page internal anchor Pith review Pith/arXiv arXiv
[4]

In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand

Agent-FLAN: Designing data and methods of effective agent tuning for large language models. In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 9354–9366, Bangkok, Thailand. Association for Computational Linguistics. Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Sheng- bang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levin...

work page 2024
[5]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Sft mem- orizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161. Quy-Anh Dang and Chris Ngo

work page internal anchor Pith review Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2503.16219

Reinforcement learning for reasoning in small llms: What works and what doesn’t. arXiv preprint arXiv:2503.16219. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al

work page arXiv
[7]

The Llama 3 Herd of Models

The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2309.17452

Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al

work page arXiv
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: In- centivizing reasoning capability in llms via reinforce- ment learning. arXiv preprint arXiv:2501.12948. Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2311.12871

An em- bodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871. Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu

work page arXiv
[11]

Yoshitaka Inoue, Tianci Song, and Tianfan Fu

O1 replication journey–part 2: Surpassing o1-preview through sim- ple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489. Yoshitaka Inoue, Tianci Song, and Tianfan Fu

work page arXiv
[12]

arXiv preprint arXiv:2408.13378

Drugagent: Explainable drug repurposing agent with large language model-based reasoning. arXiv preprint arXiv:2408.13378. Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han

work page arXiv
[13]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search- r1: Training llms to reason and leverage search en- gines with reinforcement learning. arXiv preprint arXiv:2503.09516. Minki Kang, Jongwon Jeong, and Jaewoong Cho

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2504.04718

T1: Tool-integrated self-verification for test-time compute scaling in small language models. arXiv preprint arXiv:2504.04718. Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier

work page arXiv
[15]

arXiv preprint arXiv:2312.14925

A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925. Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Salman Khan, and Fahad Shahbaz Khan

work page arXiv
[16]

arXiv preprint arXiv:2502.21321

Llm post-training: A deep dive into reasoning large lan- guage models. arXiv preprint arXiv:2502.21321. Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li

work page arXiv
[17]

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116

Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116. Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025a. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886. Xuefeng Li, Haoyang Zou, and Pengfei Liu. 2025b. Torl: Scaling tool-int...

work page arXiv 2023
[18]

arXiv preprint arXiv:2401.08190

Mario: Math reasoning with code interpreter output–a reproducible pipeline. arXiv preprint arXiv:2401.08190. Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al

work page arXiv
[19]

arXiv preprint arXiv:2410.04587

Ham- mer: Robust function-calling for on-device lan- guage models via function masking. arXiv preprint arXiv:2410.04587. Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al

work page arXiv
[20]

arXiv preprint arXiv:2305.18703

Do- main specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703. Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al

work page arXiv
[21]

arXiv preprint arXiv:2409.00920

Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920. Yu Meng, Mengzhou Xia, and Danqi Chen

work page arXiv
[22]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2210.03350

Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350. Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji

work page arXiv
[24]

arXiv preprint arXiv:2502.11435

Smart: Self-aware agent for tool overuse mitigation. arXiv preprint arXiv:2502.11435. Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

work page arXiv
[25]

In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 6922–6939

Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 6922–6939. Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xi- aocheng Yang, Denghui Zhang, et al. 2024a. Es- capebench: Pushing...

work page arXiv 2023
[26]

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al

arXiv preprint arXiv:2410.18982. Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al

work page arXiv
[27]

arXiv preprint arXiv.2304.08354,

Tool learning with foundation models. arXiv preprint arXiv.2304.08354,

work page arXiv
[28]

arXiv preprint arXiv:2405.17631

Biodiscov- eryagent: An ai agent for designing genetic perturba- tion experiments. arXiv preprint arXiv:2405.17631. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom

work page arXiv
[29]

Proximal Policy Optimization Algorithms

Proxi- mal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al

work page internal anchor Pith review Pith/arXiv arXiv
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al

work page internal anchor Pith review Pith/arXiv arXiv
[31]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm- r1: A stable and generalizable r1-style large vision- language model. arXiv preprint arXiv:2504.07615. Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu

work page internal anchor Pith review Pith/arXiv arXiv
[32]

HybridFlow: A Flexible and Efficient RLHF Framework

Hybridflow: A flex- ible and efficient rlhf framework. arXiv preprint arXiv:2409.19256. Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- Rong Wen

work page internal anchor Pith review Pith/arXiv arXiv
[33]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

R1-searcher: Incentivizing the search capability in llms via reinforcement learning. arXiv preprint arXiv:2503.05592. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599. Qwen Team

work page internal anchor Pith review Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2310.03214

Freshllms: Refreshing large language models with search engine augmenta- tion. arXiv preprint arXiv:2310.03214. Yidong Wang, Qi Guo, Wenjin Yao, Hongbo Zhang, Xin Zhang, Zhen Wu, Meishan Zhang, Xinyu Dai, Qingsong Wen, Wei Ye, et al

work page arXiv
[36]

arXiv preprint arXiv:2502.14768

Logic-rl: Un- leashing llm reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2502.14768. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao

work page arXiv
[37]

arXiv preprint arXiv:2308.12519

Rational decision-making agent with internalized utility judgment. arXiv preprint arXiv:2308.12519. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al

work page arXiv
[38]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open- source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476. Yuanqing Yu, Zhefan Wang, Weizhi Ma, Zhicheng Guo, Jingtao Zhan, Shuai Wang, Chuhan Wu, Zhiqiang Guo, and Min Zhang

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv preprint arXiv:2410.07745

Steptool: A step-grained reinforcement learning framework for tool learning in llms. arXiv preprint arXiv:2410.07745. Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al

work page arXiv
[40]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Vapo: Efficient and reliable reinforcement learn- ing for advanced reasoning tasks. arXiv preprint arXiv:2504.05118. Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang

work page internal anchor Pith review Pith/arXiv arXiv
[41]

In Findings of the Association for Computational Linguistics: ACL 2024 , pages 3053–3077, Bangkok, Thailand

AgentTun- ing: Enabling generalized agent abilities for LLMs. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 3053–3077, Bangkok, Thailand. Association for Computational Linguistics. Yuanzhao Zhai, Tingkai Yang, Kele Xu, Feng Dawei, Cheng Yang, Bo Ding, and Huaimin Wang

work page 2024
[42]

arXiv preprint arXiv:2409.09345

En- hancing decision-making for llm agents via step-level q-value models. arXiv preprint arXiv:2409.09345. Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang Gan

work page arXiv
[43]

arXiv preprint arXiv:2307.02485

Building cooperative em- bodied agents modularly with large language models. arXiv preprint arXiv:2307.02485. Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al

work page arXiv
[44]

arXiv preprint arXiv:2409.03215

xlam: A family of large action models to empower ai agent systems. arXiv preprint arXiv:2409.03215. Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu

work page arXiv
[45]

arXiv preprint arXiv:2504.03160

Deepresearcher: Scaling deep research via reinforce- ment learning in real-world environments. arXiv preprint arXiv:2504.03160. Appendix A User Prompt Details The system instruction is shown in Figure

work page arXiv
[46]

theoretical

B Experiment Details Training Data Details. We empirically use 4K data points for training, as each dataset consists of samples drawn from the same distribution. Adding more data of similar nature does not increase task diversity. Moreover, we observe that increasing the dataset size beyond 4K does not yield noticeable improvements in the training converg...

work page 2048