pith. sign in

arxiv: 2604.27859 · v3 · pith:DCK5DTHPnew · submitted 2026-04-30 · 💻 cs.AI · cs.ET

Rethinking Agentic Reinforcement Learning In Large Language Models

Pith reviewed 2026-05-19 16:54 UTC · model grok-4.3

classification 💻 cs.AI cs.ET
keywords agentic reinforcement learninglarge language modelsmeta-reasoningself-reflectionautonomous agentsopen-ended tasksdynamic planning
0
0 comments X

The pith

Large language models shift reinforcement learning from fixed rewards to autonomous agents that reason and plan in uncertain settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the move from classic reinforcement learning, which optimizes static rewards in closed environments, toward agentic versions that use large language models to create agents able to set goals, make long-term plans, and adapt strategies on the fly. The key addition is folding cognitive-style processes such as meta-reasoning and self-reflection straight into the training loop, replacing reliance on short episodes and unchanging objectives. The authors lay out the underlying ideas, new design methods, and practical implementations while flagging open problems and sketching directions for further work.

Core claim

LLM-based Agentic RL extends traditional RL by placing cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly inside the learning loop, allowing agents to handle goal-setting, long-term planning, and dynamic adaptation in open, uncertain environments instead of depending on predefined rewards and episodic interactions.

What carries the argument

LLM-based Agentic RL, the framework that embeds meta-reasoning, self-reflection, and interactive reasoning into the RL training process to support autonomous behavior.

If this is right

  • Agents can operate without hand-crafted reward functions in real-world, open-ended tasks.
  • Learning becomes more robust through ongoing internal reflection rather than external feedback alone.
  • Multi-step reasoning enables agents to manage longer time horizons and changing conditions.
  • Design patterns for these agents can be reused across robotics, simulation, and interactive systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach may lower the amount of human-specified objectives needed to train capable agents.
  • It suggests a route toward AI systems that improve their own planning rules during interaction.
  • Combining the method with external tools or memory structures could further stabilize long-horizon behavior.

Load-bearing premise

Large language models can reliably sustain meta-reasoning and self-reflection inside the reinforcement learning loop without generating instability or hallucinations that break long-term planning.

What would settle it

A controlled test in which LLM agents repeatedly produce inconsistent self-reflections or hallucinations that cause them to fail at sustained planning in changing, uncertain environments.

Figures

Figures reproduced from arXiv: 2604.27859 by Cheng Fang, Fangming Cui, Jiahong Li, Ruixiao Zhu, Sunan Li.

Figure 1
Figure 1. Figure 1: Agent. 𝑄 𝜋 (𝑠, 𝑎) = E𝜋 "∑︁∞ 𝑘=0 𝛾 𝑘𝑅𝑡+𝑘 view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration of DPO, PPO and GRPO. latest Qwen3 models. The GSPO loss is defined as: JGSPO (𝜃) = E𝑥∼D, {𝑦𝑖 } 𝐺 𝑖=1 ∼𝜋𝜃old (· |𝑥 )  1 𝐺 ∑︁𝐺 𝑖=1 min  𝑤𝑖(𝜃) 𝐴b𝑖 , clip 𝑤𝑖(𝜃), 1 − 𝜀, 1 + 𝜀  𝐴b𝑖   , (20) where 𝐴b𝑖 = 𝑟 (𝑥, 𝑦𝑖) − mean  {𝑟 (𝑥, 𝑦𝑖)}𝐺 𝑖=1  std  {𝑟 (𝑥, 𝑦𝑖)}𝐺 𝑖=1  , (21) view at source ↗
Figure 3
Figure 3. Figure 3: Evolution diagram of RL algorithm technology. view at source ↗
read the original abstract

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript surveys the emerging area of LLM-based Agentic Reinforcement Learning. It contrasts traditional RL (static objectives, episodic interactions, specialized agents) with an agentic paradigm in which LLMs enable autonomous agents that perform goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning. The central descriptive claim is that cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making are incorporated directly into the learning loop. The paper reviews conceptual foundations, methodological innovations, and effective designs, then identifies challenges and future directions.

Significance. If the synthesis is accurate, the survey could serve as a useful entry point for researchers by organizing trends at the intersection of LLMs and RL and by cataloging open challenges. Its value is primarily in structured overview and problem identification rather than new theorems, derivations, or controlled experiments. No machine-checked proofs, reproducible code, or falsifiable quantitative predictions are presented.

major comments (1)
  1. [Abstract] Abstract: The assertion that LLM-based Agentic RL 'incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop' is presented as the key distinction from traditional RL, yet the manuscript supplies no concrete mechanisms, update rules, or controlled comparisons showing how these capabilities are realized without instability or hallucination. This framing is load-bearing for the claimed paradigm shift.
minor comments (2)
  1. The manuscript would benefit from explicit citations to representative papers or systems for each methodological innovation and design pattern discussed, to allow readers to trace the claims to primary sources.
  2. Terminology such as 'agentic paradigms' and 'cognitive-like capabilities' should be defined more precisely on first use, with clear distinctions from related concepts in existing RL and LLM literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on our survey manuscript. We address the single major comment below and outline the revisions we will make to improve clarity and precision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that LLM-based Agentic RL 'incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop' is presented as the key distinction from traditional RL, yet the manuscript supplies no concrete mechanisms, update rules, or controlled comparisons showing how these capabilities are realized without instability or hallucination. This framing is load-bearing for the claimed paradigm shift.

    Authors: We agree that the abstract presents this distinction at a high level and that, as a survey, the manuscript does not introduce new mechanisms, update rules, or original controlled experiments. Instead, it synthesizes and organizes existing literature on how these cognitive-like capabilities are being integrated into agentic RL loops through reviewed methodological innovations (detailed in the sections on conceptual foundations and effective designs). The paper also explicitly catalogs instability and hallucination as open challenges rather than claiming they have been resolved. To address the concern, we will revise the abstract to more explicitly frame the claim as a synthesis of trends from the surveyed works, with a brief illustrative reference to representative mechanisms (e.g., reflection-augmented policy updates) drawn from the literature. This revision will preserve the abstract's length while strengthening the evidential grounding for the paradigm-shift framing. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript is a survey paper that reviews conceptual foundations, methodological trends, and open challenges in LLM-based agentic RL. It advances no new formal derivation, theorem, fitted parameters, or quantitative predictions. All claims are framed descriptively as observations of an emerging paradigm rather than results obtained by reducing equations or self-citations to the paper's own inputs. The derivation chain is therefore empty by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard domain assumptions from RL and LLM research (e.g., that models can perform multi-step reasoning when prompted appropriately) without introducing new free parameters, axioms specific to this work, or invented entities.

axioms (1)
  • domain assumption Large language models can incorporate meta-reasoning and self-reflection into decision-making loops when used as agents.
    Invoked in the abstract when describing how LLM-based Agentic RL adds cognitive-like capabilities directly into the learning loop.

pith-pipeline@v0.9.0 · 5688 in / 1130 out tokens · 37734 ms · 2026-05-19T16:54:19.083818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

133 extracted references · 133 canonical work pages · 41 internal anchors

  1. [1]

    Anthropic. 2025. Claude code: Deep coding at terminal velocity. https://www.anthropic.com/claude-code Anthropic’s agentic command-line coding tool, introduced alongside Claude 3.7 Sonnet. Enables developers to delegate engineering tasks directly from their terminal via natural-language commands

  2. [2]

    R. M. Aratchige and W. M. K. S. Ilmini. 2025. Llms working in harmony: A survey on the technological aspects of building effective llm-based multi agent systems. (2025). https://arxiv.org/abs/2504.01963

  3. [3]

    Andrea Asperti, Alberto Naibo, and Claudio Sacerdoti Coen. 2025. Thinking machines: Mathematical reasoning in the age of llms. (2025). https://arxiv.org/abs/2508.00459

  4. [4]

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. 2025. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. (2025). https://arxiv.org/abs/2505.20411

  5. [5]

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2025. Hallucination of multimodal large language models: A survey. (2025). https://arxiv.org/abs/2404.18930

  6. [6]

    Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. 2023. Fireact: Toward language agent fine-tuning. (2023). https://arxiv.org/abs/2310.05915

  7. [7]

    Lei Chen, Xingyu Xu, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. 2025a. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587(2025a). https://arxiv.org/abs/2508.13587

  8. [8]

    Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, and Chuchu Fan. 2025h. R1-code-interpreter: Training llms to reason with code via supervised and reinforcement learning. (2025h). https://arxiv.org/abs/2505.21668

  9. [9]

    Prateek Chhikara, Dev Khant, Saket Arya, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory. (2025). https://arxiv.org/abs/2504.19413

  10. [10]

    Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Youngwoo Yoon, Minsu Jang, Dohyung Kim, and Jaehong Kim. 2025. Reactree: Hierarchical llm agent trees with control flow for long-horizon task planning.arXiv preprint arXiv:2511.02424(2025)

  11. [11]

    Manuel Cossio. 2025. A comprehensive taxonomy of hallucinations in large language models. (2025). https://arxiv.org/abs/2508.01781

  12. [12]

    Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. 2025. Llm-guided probabilistic program induction for pomdp model estimation. InConference on Robot Learning. PMLR, 3137–3184

  13. [13]

    Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. 2024. Process supervision-guided policy optimization for code generation.arXiv preprint arXiv:2410.17621(2024)

  14. [14]

    Yihong Dong, Xue Jiang, Jiaru Qian, Tian Zhang, Zhi Jin, and Ge Li. 2025b. A survey on code generation with llm-based agents. (2025b). https://arxiv.org/abs/2508.00083

  15. [15]

    Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Wei, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, and Ge Li. 2025c. Rl-plus: Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization. (2025c). https://arxiv.org/abs/2508.00222

  16. [16]

    Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Huang Xuanjing. 2024. StepCoder: Improving code generation with reinforcement learning from compiler feedback. InProceedings of the 62nd Annual Meeting of the Association for Com...

  17. [17]

    Kawin Ethayarajah, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Model alignment as prospect theoretic optimization. (2024). https://openreview.net/forum?id=iUwHnoENnl

  18. [18]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Heng Li, Dawei Yin, Tat-Seng Tang, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24). Association for Computing Machinery, New York, NY, USA, 6491–6501. doi:10....

  19. [19]

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025a. Retool: Reinforcement learning for strategic tool use in llms.CoRRabs/2504.11536 (April 2025a). https://doi.org/10.48550/arXiv.2504.11536

  20. [20]

    Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. 2025a. Octonav: Towards generalist embodied navigation. (2025a). https://arxiv.org/abs/2506.09839

  21. [21]

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)

  22. [22]

    Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Wang, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Ruiqi Ren, Qian Cheng, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2025. A survey of self-evolving agents:...

  23. [23]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. (2024). https://arxiv.org/abs/2312.10997 A Brief Overview: Agentic Reinforcement Learning In Large Language Models 13

  24. [24]

    Jonas Gehring, Kunhao Zheng, Jade Coppet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. 2025. RLEF: Grounding code LLMs in execution feedback with reinforcement learning. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=PzSG5nKe1q

  25. [25]

    Xinyu Geng, Peng Xia, Zhen Peng, Xinyu Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025. Webwatcher: Breaking new frontiers of vision-language deep research agent. (2025). https://arxiv.org/abs/2508.05748

  26. [26]

    Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. 2025. Training long-context, multi-turn software engineering agents with reinforcement learning. (2025). https://arxiv.org/abs/2508.03501

  27. [27]

    Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Sun, Tianci Zhang, Jian Xie, Yifei Li, Tianyu Xue, Zeyi Liao, Kai Zhang, Viktor Cai, Morteza Rozgic, Murtuza Ziyadi, and Huan Sun. 2025. Mind2web 2: Evaluating agentic search wi...

  28. [28]

    Yuxuan Guo, Shaohui Peng, Jiaming Guo, Di Huang, Xishan Zhang, Rui Zhang, Yifan Hao, Ling Li, Zikang Tian, Ming Gao, Yutai Li, Yiming Li, Shuai Liang, Zihao Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen. 2024. Luban: Building open-ended creative agents via autonomous embodied verification.CoRRabs/2405.15414 (2024). https://doi.org/10.48550/arXiv.2405.15414

  29. [29]

    Zichuan Guo and Hao Wang. 2025. A survey of reinforcement learning in large language models: From data generation to test-time inference. A vailable at SSRN 5128927(2025). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5128927

  30. [30]

    Xudong Han, Junjie Yang, Tianyang Wang, Ziqian Bi, Xinyuan Song, Junfeng Hao, and Junhao Song. 2025. Towards alignment-centric paradigm: A survey of instruction tuning in large language models. (2025). https://arxiv.org/abs/2508.17184

  31. [31]

    Qianyue Hao, Sibo Li, Jian Yuan, and Yong Li. 2025b. Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning. (2025b). https://arxiv.org/abs/2505.14140

  32. [32]

    Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, and Bolin Ding. 2026. SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees. arXiv preprint arXiv:2602.06554

  33. [33]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025b. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (January 2025b), 1–55. doi:10.1145/3703155

  34. [34]

    Xu Huang, Weiwen Liu, Xiaolong Wang, Xingmei Wang, Hao Zhang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024a. Understanding the planning of llm agents: A survey.CoRRabs/2402.02716 (2024a). https://doi.org/10.48550/arXiv.2402.02716

  35. [35]

    Li Kang, Xiufeng Song, Heng Zhou, Yiran Yue, Yuchang Qin, Yang Li, Xiaohong Liu, Philip Torr, Lei Bai, and Zhenfei Yin. 2025a. Viki-r: Coordinating embodied multi-agent cooperation via reinforcement learning.arXiv preprint arXiv:2506.09049(2025a)

  36. [36]

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. 2024. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679(2024). https://arxiv.org/abs/2410.01679

  37. [37]

    Zixuan Ke, Fangkai Jiao, Xuan-Phi Nguyen, Minh Long, PeiFeng Wang, Silvio Xiong, Sunita Sarawagi, Xiong Caiming, and Joty Shafiq. 2025. A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=SlsZZ25InC Survey Certification

  38. [38]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. (2024), 881–905

  39. [39]

    Singh Kunal, Omkar Thawakar, Khanna Mukund, and Thawakar Singh. 2025. Trishul: A training-free agentic framework for zero-shot gui action grounding. (2025). https://arxiv.org/abs/2502.08226

  40. [40]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. 35 (2022), 21314–21328. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8636419dea1aa9fbd25fc4248e702da4-Paper-Conference.pdf

  41. [41]

    Junyi Li and Hwee Tou Ng. 2025. Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models.arXiv preprint arXiv:2505.24630(2025)

  42. [42]

    Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, et al. 2025. Reinforcement learning foundations for deep research systems: A survey.arXiv preprint arXiv:2509.06733(2025)

  43. [43]

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. 2025j. Webthinker: Empowering large reasoning models with deep research capability. (2025j). https://arxiv.org/abs/2504.21776

  44. [44]

    Yizhi Li, Qingshui Gu, Zhoufutu Wen, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xingwei Qu, Wangchunshu Wang, et al . 2025n. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445 (2025n)

  45. [45]

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Jia, Zengyan Liu, Yuxin Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, and Cheng-Lin Liu. 2025r. From system 1 to system 2: A survey of reasoning large language models. (2025r). https://arxiv.org/abs/2502.17419

  46. [46]

    Shuq uan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. 2025. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025(2025). https://arxiv.org/abs/2507.22025

  47. [47]

    Jintao Liang, Huifeng Lin, You Wu, Rui Zhao, Ziyue Li, et al . 2025. Reasoning rag via system 1 or system 2: A survey on reasoning agentic retrieval-augmented generation for industry challenges. InProceedings of the 14th International Joint Conference on Natural Language Processing 14 Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, and Jiahong Li and the...

  48. [48]

    Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Xiang Zhang, and Suhang Wang. 2025. A comprehen- sive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724(2025)

  49. [49]

    Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Dacheng Tao, and Yongbin Li. 2025k. A survey of direct preference optimization.CoRRabs/2503.11701 (March 2025k). https://doi.org/10.48550/arXiv.2503. 11701

  50. [50]

    Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025n. InfiGUIAgent: A multimodal generalist GUI agent with native reasoning and reflection. InICML 2025 Workshop on Computer Use Agents. https://openreview.net/forum?id= p0h9XJ7fMH

  51. [51]

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025o. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025o)

  52. [52]

    Zhiwei Liu, Weiran Yao, Jianguo Zhang, Zuxin Liu, Liangwei Yang, Rithesh RN, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, et al. 2024. Pract: Optimizing principled reasoning and acting of llm agent. InProceedings of the 28th Conference on Computational Natural Language Learning. 442–446

  53. [53]

    Keer Lu, Chong Chen, Bin Cui, Huang Leng, and Wentao Zhang. 2025c. Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning.arXiv preprint arXiv:2508.00344(2025c)

  54. [54]

    Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. 2025. Scaling llm multi-turn rl with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727(2025)

  55. [55]

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanxing Xiong, and Hongsheng Li. 2025d. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620(2025d)

  56. [56]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. 2025a. Large language model agent: ...

  57. [57]

    Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025d. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025d)

  58. [58]

    Xinji Mai, Haotian Xu, Xing Wang, Weinong Ma, Jian Li, Yingying Zhang, and Wenqiang Zhang. 2025. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773(2025)

  59. [59]

    Tula Masterman, Sandi Besen Smith, Mason Sawtell, and Alex Chao. 2024. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584(2024)

  60. [60]

    Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. (2024). https://openreview. net/forum?id=3Tzcot1LKb

  61. [61]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)

  62. [62]

    Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia Yu, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Yoon, Lina Yao, Nesreen K. Ahmed, Jihyung Kil, Thien Huu Nguyen, Randolph Wickramasinghe, Ryan A. Rossi, and Franck Dernoncourt. 2025a. GUI agents: A survey. InFindings...

  63. [63]

    Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, et al

  64. [64]

    Dynasaur: Large language agents beyond predefined actions.arXiv preprint arXiv:2411.01747(2024)

  65. [65]

    Yansong Ning, Jun Fang, Naiqiang Tan, and Hao Liu. 2026. Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning.arXiv preprint arXiv:2602.04284(2026)

  66. [66]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katia Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...

  67. [67]

    Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktaschel. 2025. Learning when to plan: Efficiently allocating test-time compute for llm agents.arXiv preprint arXiv:2509.03581(2025)

  68. [68]

    Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. 2026. HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents.arXiv preprint arXiv:2602.16165(2026)

  69. [69]

    Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. 2024. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research(2024). https://openreview.net/forum?id=bNtr6SLgZf Survey Certification

  70. [70]

    van Duijn, Niki Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg

    Aske Plaat, Max J. van Duijn, Niki Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. 2025. Agentic large language models, a survey.CoRRabs/2503.23037 (March 2025). https://doi.org/10.48550/arXiv.2503.23037 A Brief Overview: Agentic Reinforcement Learning In Large Language Models 15

  71. [71]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Huang, Wanjun Wang, Kuanyu Li, Jiale Li, Yu Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xinyu Xu, Xu Jiang, Qianli Zhao, Tian Peng, Xin Liu, Guang Shi, Yankai Yang, Ji-Rong Li, Jie Tang, and Maosong Li. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint ar...

  72. [72]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. 36 (2023), 53728–53741

  73. [73]

    Z. Z. Ren, Zhihong Shao, Junxiao Song, Haocheng Xin, Haowei Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. 2025. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint ...

  74. [74]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)

  75. [75]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)

  76. [76]

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. 2025c. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720(2025c)

  77. [77]

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. (2021). https://openreview.net/forum?id=0I0X0YcCdTn

  78. [78]

    Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136(2025)

  79. [79]

    Linxin Song, Taiwei Shi, and Jieyu Zhao. 2025c. The hallucination tax of reinforcement finetuning.arXiv preprint arXiv:2505.13988(2025c)

  80. [80]

    Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Xiaoqing Zhang, Zhenhao Chen, et al. 2025e. Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models.arXiv preprint arXiv:2505.16517(2025e)

Showing first 80 references.