Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Dianxing Tang; Dingshuo Chen; Liang Wang; Liuji Chen; Qiang Liu; Shu Wu; Xing Shi

arxiv: 2606.02132 · v2 · pith:IZORAA2Tnew · submitted 2026-06-01 · 💻 cs.AI

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Liuji Chen , Dianxing Tang , Xing Shi , Dingshuo Chen , Qiang Liu , Shu Wu , Liang Wang This is my paper

Pith reviewed 2026-06-28 14:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic reinforcement learningtool abusereward shapingpolicy optimizationreasoning benchmarksaccuracy-efficiency trade-offselective tool use

0 comments

The pith

EAPO lets agents learn selective tool use by penalizing redundant calls mainly on easier queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EAPO to fix tool abuse in agentic reinforcement learning, where models overuse external tools even when internal reasoning suffices. It adds tool-free trajectories to each rollout group, applies difficulty-aware reward shaping, and uses confidence-aware token reweighting. These changes produce better accuracy with fewer tool calls than GRPO on nine math and knowledge benchmarks across three model families. A reader would care because the approach shows agents can be trained to decide when not to act without blanket limits that might block useful exploration.

Core claim

EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO improves average performance by 10.45%, 7.27%, and 9.69% on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B while reducing average tool calls by 18.33%, 18.33%, and 24.59% respectively compared with GRPO.

What carries the argument

Difficulty-aware reward shaping on tool-free trajectories within rollout groups, which selectively discourages unnecessary tool calls.

If this is right

Agents reach higher accuracy on reasoning tasks while making fewer tool calls overall.
The gains hold across model sizes from 3B to 8B parameters.
Tool use stays available for queries that genuinely need it due to the difficulty-aware design.
Policy updates improve because low-confidence tokens receive adjusted weights.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-shaping idea could extend to code or planning agents where tool overuse also occurs.
Replacing static difficulty labels with online estimation would make the method more general.
Uniform tool penalties appear less efficient than query-specific shaping for the accuracy-efficiency trade-off.

Load-bearing premise

Penalizing tool use mainly on easier queries will not suppress beneficial tool-assisted exploration on harder queries where tools remain necessary.

What would settle it

If EAPO agents show lower accuracy than GRPO on hard queries because they use tools less often, that observation would falsify the claim that the shaping preserves necessary tool use.

Figures

Figures reproduced from arXiv: 2606.02132 by Dianxing Tang, Dingshuo Chen, Liang Wang, Liuji Chen, Qiang Liu, Shu Wu, Xing Shi.

**Figure 2.** Figure 2: Proportion of tool-free trajectories in rollouts [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of EAPO algorithm [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental results under different hyperparameter settings. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Efficiency analysis of EAPO in training and inference. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAPO adds difficulty-aware shaping and confidence reweighting to GRPO to cut redundant tool calls, but the gains rest on aggregate numbers without ablations or variance.

read the letter

The main thing here is EAPO tries to fix tool overuse in agentic RL by injecting tool-free trajectories, applying penalties mostly to easier queries via difficulty-aware shaping, and reweighting tokens by model confidence. This trio is meant to avoid the blunt effects of uniform penalties that can block useful tool use on hard problems.

The approach makes sense on paper. Blanket limits hurt exploration where tools actually help, so targeting the penalty at cases the model can probably solve internally is a direct response to that. The abstract reports consistent gains across three model sizes on nine benchmarks, with fewer tool calls and sometimes higher accuracy. That kind of efficiency improvement would matter for anyone running these agents at scale.

The evidence is the weak part. We get average deltas like 7-10% better performance and 18-25% fewer calls, but nothing on run-to-run variance, statistical tests, or ablations that isolate each component. The difficulty metric itself is not described, so the stress-test worry about noisy classification on hard queries that still need tools cannot be checked. If that estimator correlates with tool necessity, the method could suppress exactly the exploration it claims to protect.

No circularity in the results and the citation pattern looks standard for this area. The work is internally coherent and engages the practical issue rather than just adding another regularizer.

This is for people working on production reasoning agents who need lower latency and cost. It is incremental rather than foundational, but the problem is real enough that the paper deserves a serious referee to check the missing controls and experimental protocol.

Referee Report

2 major / 1 minor

Summary. The paper proposes EAPO, an Efficient Agentic Policy Optimization method for agentic RL that mitigates tool abuse via tool-free trajectories in rollouts, difficulty-aware reward shaping to penalize redundant tool calls primarily on easier queries, and confidence-aware token reweighting. It claims consistent improvements to the accuracy-efficiency trade-off over GRPO across nine mathematical and knowledge-intensive benchmarks on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B, with average accuracy gains of 10.45%, 7.27%, and 9.69% alongside tool-call reductions of 18.33%, 18.33%, and 24.59%.

Significance. If the empirical claims are substantiated with variance, statistical tests, and targeted ablations, the selective penalty mechanism could meaningfully advance efficient tool use in agentic systems by showing that difficulty-aware shaping can reduce overuse without harming hard-query performance. The multi-model, multi-benchmark scope is a positive aspect of the evaluation design.

major comments (2)

[Abstract] Abstract: the central performance claims report average gains without variance, statistical tests, ablation results, or experimental protocol details (e.g., rollout counts, difficulty estimator definition, or seed reporting), leaving the quantitative improvements only weakly supported.
[Method (difficulty-aware reward shaping)] Method section on difficulty-aware reward shaping: the load-bearing assumption that the (unspecified) difficulty metric cleanly separates redundant tool use on easy queries from necessary tool use on hard queries is not validated; aggregate benchmark results do not address potential misclassification on the harder tail, which could suppress beneficial exploration as highlighted by the stress-test concern.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly defined or referenced the difficulty estimator and confidence reweighting mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we address each major comment point-by-point, indicating planned revisions where the manuscript can be strengthened.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims report average gains without variance, statistical tests, ablation results, or experimental protocol details (e.g., rollout counts, difficulty estimator definition, or seed reporting), leaving the quantitative improvements only weakly supported.

Authors: We agree that the abstract's quantitative claims would be better supported by variance, statistical details, and protocol information. In the revision we will add standard deviations across seeds to the reported averages, explicitly state the number of rollouts and seeds used, define the difficulty estimator in the method section (with a brief reference in the abstract if space allows), and include a short note on the experimental protocol. These additions will be placed in the main text or appendix as appropriate while keeping the abstract concise. revision: yes
Referee: [Method (difficulty-aware reward shaping)] Method section on difficulty-aware reward shaping: the load-bearing assumption that the (unspecified) difficulty metric cleanly separates redundant tool use on easy queries from necessary tool use on hard queries is not validated; aggregate benchmark results do not address potential misclassification on the harder tail, which could suppress beneficial exploration as highlighted by the stress-test concern.

Authors: The difficulty metric is introduced in the method section but we acknowledge it is not fully specified or validated against the harder tail in the current draft. We will revise the method section to provide an explicit definition and add a targeted analysis (new figure or table) that breaks down tool-call frequency and accuracy on the most difficult queries to demonstrate that performance on hard cases is preserved. This directly addresses the risk of misclassification suppressing exploration. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains on held-out benchmarks are independent of internal definitions

full rationale

The paper proposes EAPO as a policy optimization method with difficulty-aware reward shaping and reports direct accuracy and tool-call reductions versus GRPO on nine external benchmarks. No equations are presented that define a 'prediction' as a fitted parameter or that reduce the reported deltas to quantities constructed from the method's own inputs. The central claims rest on held-out evaluation rather than self-referential derivation or self-citation chains. This is the expected non-finding for an empirical RL methods paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions (Markovian state transitions, reward signals that correctly reflect task success) plus the untested premise that difficulty can be reliably estimated to apply selective penalties. No new physical or mathematical entities are introduced.

free parameters (1)

difficulty threshold for reward shaping
The method requires a way to classify queries as easier or harder to apply differential penalties; this threshold or classifier is not specified in the abstract.

axioms (1)

domain assumption Standard assumptions of policy gradient methods hold for the agentic tool-use MDP.
Any RL policy optimization method implicitly relies on this.

pith-pipeline@v0.9.1-grok · 5742 in / 1242 out tokens · 30444 ms · 2026-06-28T14:57:49.555530+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Einstein World Models
cs.AI 2026-06 unverdicted novelty 5.0

Einstein World Models integrate visual rollouts from a callable world-module into LLM reasoning traces to support complex thought beyond language.

Reference graph

Works this paper leans on

34 extracted references · 6 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Agentic Reinforced Policy Optimization

Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji. Agentic Reinforced Policy Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2507.19849 , eprinttype =. 2507.19849 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.19849 2025
[2]

Preprint, arXiv:2510.14545

Guanting Dong and Licheng Bao and Zhongyuan Wang and Kangzhi Zhao and Xiaoxi Li and Jiajie Jin and Jinghan Yang and Hangyu Mao and Fuzheng Zhang and Kun Gai and Guorui Zhou and Yutao Zhu and Ji. Agentic Entropy-Balanced Policy Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2510.14545 , eprinttype =. 2510.14545 , timestamp =

work page doi:10.48550/arxiv.2510.14545 2025
[3]

arXiv preprint arXiv:2509.02479 , year=

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning , author=. arXiv preprint arXiv:2509.02479 , year=

arXiv
[4]

2025 , eprint=

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. 2025 , eprint=

2025
[5]

2025 , eprint=

First Return, Entropy-Eliciting Explore , author=. 2025 , eprint=

2025
[6]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

2023
[7]

2025 , eprint=

Acting Less is Reasoning More! Teaching Model to Act Efficiently , author=. 2025 , eprint=

2025
[8]

2025 , eprint=

R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025
[9]

2025 , eprint=

ToRL: Scaling Tool-Integrated RL , author=. 2025 , eprint=

2025
[10]

2025 , eprint=

Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning , author=. 2025 , eprint=

2025
[12]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[13]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
[14]

2023 , eprint=

Let's Verify Step by Step , author=. 2023 , eprint=

2023
[15]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=
[16]

2020 , eprint=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. 2020 , eprint=

2020
[17]

2022 , publisher=

Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , journal=. 2022 , publisher=

2022
[18]

Measuring and Narrowing the Compositionality Gap in Language Models

Press, Ofir and Zhang, Muru and Min, Sewon and Schmidt, Ludwig and Smith, Noah and Lewis, Mike. Measuring and Narrowing the Compositionality Gap in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.378

work page doi:10.18653/v1/2023.findings-emnlp.378 2023
[19]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[20]

arXiv preprint arXiv:2501.03262 , year=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv preprint arXiv:2501.03262 , year=

Pith/arXiv arXiv
[21]

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning , journal =

Guanting Dong and Yifei Chen and Xiaoxi Li and Jiajie Jin and Hongjin Qian and Yutao Zhu and Hangyu Mao and Guorui Zhou and Zhicheng Dou and Ji. Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2505.16410 , eprinttype =. 2505.16410 , timestamp =

work page doi:10.48550/arxiv.2505.16410 2025
[22]

The Landscape of Agentic Reinforcement Learning for

Guibin Zhang and Hejia Geng and Xiaohang Yu and Zhenfei Yin and Zaibin Zhang and Zelin Tan and Heng Zhou and Zhong-Zhi Li and Xiangyuan Xue and Yijiang Li and Yifan Zhou and Yang Chen and Chen Zhang and Yutao Fan and Zihu Wang and Songtao Huang and Francisco Piedrahita Velez and Yue Liao and Hongru WANG and Mengyue Yang and Heng Ji and Jun Wang and Shuich...

2026
[23]

2024 , eprint=

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

2024
[24]

arXiv preprint arXiv:2412.15115 , year=

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2210.03629 , year =

ReAct: Synergizing Reasoning and Acting in Language Models , author =. arXiv preprint arXiv:2210.03629 , year =

Pith/arXiv arXiv
[26]

Bowen Jin and Hansi Zeng and Zhenrui Yue and Jinsung Yoon and Sercan O. Ar. Search-R1: Training. arXiv preprint arXiv:2503.09516 , year =. 2503.09516 , archivePrefix=

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2504.09696 , year =

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models , author =. arXiv preprint arXiv:2504.09696 , year =. 2504.09696 , archivePrefix=

arXiv
[28]

arXiv preprint arXiv:2505.18086 , year =

Stable Reinforcement Learning for Efficient Reasoning , author =. arXiv preprint arXiv:2505.18086 , year =. 2505.18086 , archivePrefix=

arXiv
[29]

2024 , eprint=

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=

2024
[30]

Dense Passage Retrieval for Open-Domain Question Answering

Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[31]

arXiv preprint arXiv:2507.21836 , year=

Autotir: Autonomous tools integrated reasoning via reinforcement learning , author=. arXiv preprint arXiv:2507.21836 , year=

arXiv
[32]

SMART : Self-Aware Agent for Tool Overuse Mitigation

Qian, Cheng and Acikgoz, Emre Can and Wang, Hongru and Chen, Xiusi and Sil, Avirup and Hakkani-T. SMART : Self-Aware Agent for Tool Overuse Mitigation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.239

work page doi:10.18653/v1/2025.findings-acl.239 2025
[33]

2026 , eprint=

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning , author=. 2026 , eprint=

2026
[34]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[1] [1]

Agentic Reinforced Policy Optimization

Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji. Agentic Reinforced Policy Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2507.19849 , eprinttype =. 2507.19849 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.19849 2025

[2] [2]

Preprint, arXiv:2510.14545

Guanting Dong and Licheng Bao and Zhongyuan Wang and Kangzhi Zhao and Xiaoxi Li and Jiajie Jin and Jinghan Yang and Hangyu Mao and Fuzheng Zhang and Kun Gai and Guorui Zhou and Yutao Zhu and Ji. Agentic Entropy-Balanced Policy Optimization , journal =. 2025 , url =. doi:10.48550/ARXIV.2510.14545 , eprinttype =. 2510.14545 , timestamp =

work page doi:10.48550/arxiv.2510.14545 2025

[3] [3]

arXiv preprint arXiv:2509.02479 , year=

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning , author=. arXiv preprint arXiv:2509.02479 , year=

arXiv

[4] [4]

2025 , eprint=

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. 2025 , eprint=

2025

[5] [5]

2025 , eprint=

First Return, Entropy-Eliciting Explore , author=. 2025 , eprint=

2025

[6] [6]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

2023

[7] [7]

2025 , eprint=

Acting Less is Reasoning More! Teaching Model to Act Efficiently , author=. 2025 , eprint=

2025

[8] [8]

2025 , eprint=

R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025

[9] [9]

2025 , eprint=

ToRL: Scaling Tool-Integrated RL , author=. 2025 , eprint=

2025

[10] [10]

2025 , eprint=

Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards , author=. 2025 , eprint=

2025

[11] [11]

2025 , eprint=

Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning , author=. 2025 , eprint=

2025

[12] [12]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021

[13] [13]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

[14] [14]

2023 , eprint=

Let's Verify Step by Step , author=. 2023 , eprint=

2023

[15] [15]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=

[16] [16]

2020 , eprint=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. 2020 , eprint=

2020

[17] [17]

2022 , publisher=

Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish , journal=. 2022 , publisher=

2022

[18] [18]

Measuring and Narrowing the Compositionality Gap in Language Models

Press, Ofir and Zhang, Muru and Min, Sewon and Schmidt, Ludwig and Smith, Noah and Lewis, Mike. Measuring and Narrowing the Compositionality Gap in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.378

work page doi:10.18653/v1/2023.findings-emnlp.378 2023

[19] [19]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[20] [20]

arXiv preprint arXiv:2501.03262 , year=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv preprint arXiv:2501.03262 , year=

Pith/arXiv arXiv

[21] [21]

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning , journal =

Guanting Dong and Yifei Chen and Xiaoxi Li and Jiajie Jin and Hongjin Qian and Yutao Zhu and Hangyu Mao and Guorui Zhou and Zhicheng Dou and Ji. Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning , journal =. 2025 , url =. doi:10.48550/ARXIV.2505.16410 , eprinttype =. 2505.16410 , timestamp =

work page doi:10.48550/arxiv.2505.16410 2025

[22] [22]

The Landscape of Agentic Reinforcement Learning for

Guibin Zhang and Hejia Geng and Xiaohang Yu and Zhenfei Yin and Zaibin Zhang and Zelin Tan and Heng Zhou and Zhong-Zhi Li and Xiangyuan Xue and Yijiang Li and Yifan Zhou and Yang Chen and Chen Zhang and Yutao Fan and Zihu Wang and Songtao Huang and Francisco Piedrahita Velez and Yue Liao and Hongru WANG and Mengyue Yang and Heng Ji and Jun Wang and Shuich...

2026

[23] [23]

2024 , eprint=

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

2024

[24] [24]

arXiv preprint arXiv:2412.15115 , year=

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2210.03629 , year =

ReAct: Synergizing Reasoning and Acting in Language Models , author =. arXiv preprint arXiv:2210.03629 , year =

Pith/arXiv arXiv

[26] [26]

Bowen Jin and Hansi Zeng and Zhenrui Yue and Jinsung Yoon and Sercan O. Ar. Search-R1: Training. arXiv preprint arXiv:2503.09516 , year =. 2503.09516 , archivePrefix=

Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2504.09696 , year =

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models , author =. arXiv preprint arXiv:2504.09696 , year =. 2504.09696 , archivePrefix=

arXiv

[28] [28]

arXiv preprint arXiv:2505.18086 , year =

Stable Reinforcement Learning for Efficient Reasoning , author =. arXiv preprint arXiv:2505.18086 , year =. 2505.18086 , archivePrefix=

arXiv

[29] [29]

2024 , eprint=

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=

2024

[30] [30]

Dense Passage Retrieval for Open-Domain Question Answering

Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau. Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.550

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[31] [31]

arXiv preprint arXiv:2507.21836 , year=

Autotir: Autonomous tools integrated reasoning via reinforcement learning , author=. arXiv preprint arXiv:2507.21836 , year=

arXiv

[32] [32]

SMART : Self-Aware Agent for Tool Overuse Mitigation

Qian, Cheng and Acikgoz, Emre Can and Wang, Hongru and Chen, Xiusi and Sil, Avirup and Hakkani-T. SMART : Self-Aware Agent for Tool Overuse Mitigation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.239

work page doi:10.18653/v1/2025.findings-acl.239 2025

[33] [33]

2026 , eprint=

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning , author=. 2026 , eprint=

2026

[34] [34]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024