Rethinking Agentic Reinforcement Learning In Large Language Models
Pith reviewed 2026-05-19 16:54 UTC · model grok-4.3
The pith
Large language models shift reinforcement learning from fixed rewards to autonomous agents that reason and plan in uncertain settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-based Agentic RL extends traditional RL by placing cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly inside the learning loop, allowing agents to handle goal-setting, long-term planning, and dynamic adaptation in open, uncertain environments instead of depending on predefined rewards and episodic interactions.
What carries the argument
LLM-based Agentic RL, the framework that embeds meta-reasoning, self-reflection, and interactive reasoning into the RL training process to support autonomous behavior.
If this is right
- Agents can operate without hand-crafted reward functions in real-world, open-ended tasks.
- Learning becomes more robust through ongoing internal reflection rather than external feedback alone.
- Multi-step reasoning enables agents to manage longer time horizons and changing conditions.
- Design patterns for these agents can be reused across robotics, simulation, and interactive systems.
Where Pith is reading between the lines
- This approach may lower the amount of human-specified objectives needed to train capable agents.
- It suggests a route toward AI systems that improve their own planning rules during interaction.
- Combining the method with external tools or memory structures could further stabilize long-horizon behavior.
Load-bearing premise
Large language models can reliably sustain meta-reasoning and self-reflection inside the reinforcement learning loop without generating instability or hallucinations that break long-term planning.
What would settle it
A controlled test in which LLM agents repeatedly produce inconsistent self-reflections or hallucinations that cause them to fail at sustained planning in changing, uncertain environments.
Figures
read the original abstract
Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys the emerging area of LLM-based Agentic Reinforcement Learning. It contrasts traditional RL (static objectives, episodic interactions, specialized agents) with an agentic paradigm in which LLMs enable autonomous agents that perform goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning. The central descriptive claim is that cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making are incorporated directly into the learning loop. The paper reviews conceptual foundations, methodological innovations, and effective designs, then identifies challenges and future directions.
Significance. If the synthesis is accurate, the survey could serve as a useful entry point for researchers by organizing trends at the intersection of LLMs and RL and by cataloging open challenges. Its value is primarily in structured overview and problem identification rather than new theorems, derivations, or controlled experiments. No machine-checked proofs, reproducible code, or falsifiable quantitative predictions are presented.
major comments (1)
- [Abstract] Abstract: The assertion that LLM-based Agentic RL 'incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop' is presented as the key distinction from traditional RL, yet the manuscript supplies no concrete mechanisms, update rules, or controlled comparisons showing how these capabilities are realized without instability or hallucination. This framing is load-bearing for the claimed paradigm shift.
minor comments (2)
- The manuscript would benefit from explicit citations to representative papers or systems for each methodological innovation and design pattern discussed, to allow readers to trace the claims to primary sources.
- Terminology such as 'agentic paradigms' and 'cognitive-like capabilities' should be defined more precisely on first use, with clear distinctions from related concepts in existing RL and LLM literature.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback on our survey manuscript. We address the single major comment below and outline the revisions we will make to improve clarity and precision.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that LLM-based Agentic RL 'incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop' is presented as the key distinction from traditional RL, yet the manuscript supplies no concrete mechanisms, update rules, or controlled comparisons showing how these capabilities are realized without instability or hallucination. This framing is load-bearing for the claimed paradigm shift.
Authors: We agree that the abstract presents this distinction at a high level and that, as a survey, the manuscript does not introduce new mechanisms, update rules, or original controlled experiments. Instead, it synthesizes and organizes existing literature on how these cognitive-like capabilities are being integrated into agentic RL loops through reviewed methodological innovations (detailed in the sections on conceptual foundations and effective designs). The paper also explicitly catalogs instability and hallucination as open challenges rather than claiming they have been resolved. To address the concern, we will revise the abstract to more explicitly frame the claim as a synthesis of trends from the surveyed works, with a brief illustrative reference to representative mechanisms (e.g., reflection-augmented policy updates) drawn from the literature. This revision will preserve the abstract's length while strengthening the evidential grounding for the paradigm-shift framing. revision: yes
Circularity Check
No significant circularity identified
full rationale
The manuscript is a survey paper that reviews conceptual foundations, methodological trends, and open challenges in LLM-based agentic RL. It advances no new formal derivation, theorem, fitted parameters, or quantitative predictions. All claims are framed descriptively as observations of an emerging paradigm rather than results obtained by reducing equations or self-citations to the paper's own inputs. The derivation chain is therefore empty by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can incorporate meta-reasoning and self-reflection into decision-making loops when used as agents.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PPO loss ... GRPO loss ... DAPO loss
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Claude code: Deep coding at terminal velocity. https://www.anthropic.com/claude-code Anthropic’s agentic command-line coding tool, introduced alongside Claude 3.7 Sonnet. Enables developers to delegate engineering tasks directly from their terminal via natural-language commands
work page 2025
- [2]
- [3]
-
[4]
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. 2025. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. (2025). https://arxiv.org/abs/2505.20411
-
[5]
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2025. Hallucination of multimodal large language models: A survey. (2025). https://arxiv.org/abs/2404.18930
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [6]
- [7]
- [8]
-
[9]
Prateek Chhikara, Dev Khant, Saket Arya, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory. (2025). https://arxiv.org/abs/2504.19413
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [10]
- [11]
-
[12]
Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. 2025. Llm-guided probabilistic program induction for pomdp model estimation. InConference on Robot Learning. PMLR, 3137–3184
work page 2025
- [13]
-
[14]
Yihong Dong, Xue Jiang, Jiaru Qian, Tian Zhang, Zhi Jin, and Ge Li. 2025b. A survey on code generation with llm-based agents. (2025b). https://arxiv.org/abs/2508.00083
work page internal anchor Pith review arXiv
-
[15]
Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Wei, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, and Ge Li. 2025c. Rl-plus: Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization. (2025c). https://arxiv.org/abs/2508.00222
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Huang Xuanjing. 2024. StepCoder: Improving code generation with reinforcement learning from compiler feedback. InProceedings of the 62nd Annual Meeting of the Association for Com...
-
[17]
Kawin Ethayarajah, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Model alignment as prospect theoretic optimization. (2024). https://openreview.net/forum?id=iUwHnoENnl
work page 2024
-
[18]
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Heng Li, Dawei Yin, Tat-Seng Tang, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24). Association for Computing Machinery, New York, NY, USA, 6491–6501. doi:10....
-
[19]
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025a. Retool: Reinforcement learning for strategic tool use in llms.CoRRabs/2504.11536 (April 2025a). https://doi.org/10.48550/arXiv.2504.11536
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.11536
- [20]
-
[21]
Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Wang, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Ruiqi Ren, Qian Cheng, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2025. A survey of self-evolving agents:...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. (2024). https://arxiv.org/abs/2312.10997 A Brief Overview: Agentic Reinforcement Learning In Large Language Models 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Jonas Gehring, Kunhao Zheng, Jade Coppet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. 2025. RLEF: Grounding code LLMs in execution feedback with reinforcement learning. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=PzSG5nKe1q
work page 2025
-
[25]
Xinyu Geng, Peng Xia, Zhen Peng, Xinyu Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025. Webwatcher: Breaking new frontiers of vision-language deep research agent. (2025). https://arxiv.org/abs/2508.05748
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. 2025. Training long-context, multi-turn software engineering agents with reinforcement learning. (2025). https://arxiv.org/abs/2508.03501
-
[27]
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Sun, Tianci Zhang, Jian Xie, Yifei Li, Tianyu Xue, Zeyi Liao, Kai Zhang, Viktor Cai, Morteza Rozgic, Murtuza Ziyadi, and Huan Sun. 2025. Mind2web 2: Evaluating agentic search wi...
-
[28]
Yuxuan Guo, Shaohui Peng, Jiaming Guo, Di Huang, Xishan Zhang, Rui Zhang, Yifan Hao, Ling Li, Zikang Tian, Ming Gao, Yutai Li, Yiming Li, Shuai Liang, Zihao Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen. 2024. Luban: Building open-ended creative agents via autonomous embodied verification.CoRRabs/2405.15414 (2024). https://doi.org/10.48550/arXiv.2405.15414
-
[29]
Zichuan Guo and Hao Wang. 2025. A survey of reinforcement learning in large language models: From data generation to test-time inference. A vailable at SSRN 5128927(2025). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5128927
work page 2025
- [30]
- [31]
- [32]
-
[33]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025b. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (January 2025b), 1–55. doi:10.1145/3703155
-
[34]
Xu Huang, Weiwen Liu, Xiaolong Wang, Xingmei Wang, Hao Zhang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024a. Understanding the planning of llm agents: A survey.CoRRabs/2402.02716 (2024a). https://doi.org/10.48550/arXiv.2402.02716
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.02716
- [35]
-
[36]
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. 2024. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679(2024). https://arxiv.org/abs/2410.01679
-
[37]
Zixuan Ke, Fangkai Jiao, Xuan-Phi Nguyen, Minh Long, PeiFeng Wang, Silvio Xiong, Sunita Sarawagi, Xiong Caiming, and Joty Shafiq. 2025. A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=SlsZZ25InC Survey Certification
work page 2025
-
[38]
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. (2024), 881–905
work page 2024
- [39]
-
[40]
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. 35 (2022), 21314–21328. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8636419dea1aa9fbd25fc4248e702da4-Paper-Conference.pdf
work page 2022
- [41]
- [42]
-
[43]
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. 2025j. Webthinker: Empowering large reasoning models with deep research capability. (2025j). https://arxiv.org/abs/2504.21776
work page internal anchor Pith review Pith/arXiv arXiv
- [44]
-
[45]
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Jia, Zengyan Liu, Yuxin Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, and Cheng-Lin Liu. 2025r. From system 1 to system 2: A survey of reasoning large language models. (2025r). https://arxiv.org/abs/2502.17419
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Shuq uan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. 2025. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025(2025). https://arxiv.org/abs/2507.22025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Jintao Liang, Huifeng Lin, You Wu, Rui Zhao, Ziyue Li, et al . 2025. Reasoning rag via system 1 or system 2: A survey on reasoning agentic retrieval-augmented generation for industry challenges. InProceedings of the 14th International Joint Conference on Natural Language Processing 14 Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, and Jiahong Li and the...
work page 2025
-
[48]
Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Xiang Zhang, and Suhang Wang. 2025. A comprehen- sive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724(2025)
-
[49]
Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Dacheng Tao, and Yongbin Li. 2025k. A survey of direct preference optimization.CoRRabs/2503.11701 (March 2025k). https://doi.org/10.48550/arXiv.2503. 11701
-
[50]
Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025n. InfiGUIAgent: A multimodal generalist GUI agent with native reasoning and reflection. InICML 2025 Workshop on Computer Use Agents. https://openreview.net/forum?id= p0h9XJ7fMH
work page 2025
-
[51]
Yuhang Liu, Pengxiang Li, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025o. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025o)
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Zuxin Liu, Liangwei Yang, Rithesh RN, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, et al. 2024. Pract: Optimizing principled reasoning and acting of llm agent. InProceedings of the 28th Conference on Computational Natural Language Learning. 442–446
work page 2024
- [53]
- [54]
-
[55]
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanxing Xiong, and Hongsheng Li. 2025d. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620(2025d)
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. 2025a. Large language model agent: ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.21460
-
[57]
Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025d. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025d)
work page internal anchor Pith review Pith/arXiv arXiv
- [58]
-
[59]
Tula Masterman, Sandi Besen Smith, Mason Sawtell, and Alex Chao. 2024. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. (2024). https://openreview. net/forum?id=3Tzcot1LKb
work page 2024
-
[61]
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[62]
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia Yu, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Yoon, Lina Yao, Nesreen K. Ahmed, Jihyung Kil, Thien Huu Nguyen, Randolph Wickramasinghe, Ryan A. Rossi, and Franck Dernoncourt. 2025a. GUI agents: A survey. InFindings...
-
[63]
Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, et al
- [64]
-
[65]
Yansong Ning, Jun Fang, Naiqiang Tan, and Hao Liu. 2026. Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning.arXiv preprint arXiv:2602.04284(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[66]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katia Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...
work page 2022
-
[67]
Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktaschel. 2025. Learning when to plan: Efficiently allocating test-time compute for llm agents.arXiv preprint arXiv:2509.03581(2025)
-
[68]
Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. 2026. HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents.arXiv preprint arXiv:2602.16165(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[69]
Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. 2024. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research(2024). https://openreview.net/forum?id=bNtr6SLgZf Survey Certification
work page 2024
-
[70]
van Duijn, Niki Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg
Aske Plaat, Max J. van Duijn, Niki Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. 2025. Agentic large language models, a survey.CoRRabs/2503.23037 (March 2025). https://doi.org/10.48550/arXiv.2503.23037 A Brief Overview: Agentic Reinforcement Learning In Large Language Models 15
-
[71]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Huang, Wanjun Wang, Kuanyu Li, Jiale Li, Yu Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xinyu Xu, Xu Jiang, Qianli Zhao, Tian Peng, Xin Liu, Guang Shi, Yankai Yang, Ji-Rong Li, Jie Tang, and Maosong Li. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. 36 (2023), 53728–53741
work page 2023
-
[73]
Z. Z. Ren, Zhihong Shao, Junxiao Song, Haocheng Xin, Haowei Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. 2025. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[75]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)
work page internal anchor Pith review Pith/arXiv arXiv
- [76]
-
[77]
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. (2021). https://openreview.net/forum?id=0I0X0YcCdTn
work page 2021
-
[78]
Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.