pith. sign in

arxiv: 2606.21943 · v1 · pith:G4F5OO4Ynew · submitted 2026-06-20 · 💻 cs.LG · cs.AI

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

Pith reviewed 2026-06-26 12:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learninglarge language modelsMDP creationexplorationcredit assignmentpolicy gradientsvalue-based methodssurvey
0
0 comments X

The pith

A three-stage taxonomy of RL for LLMs shows research effort clustered in critic-free policy gradients and Monte Carlo credit assignment while value-based methods and off-policy actor-critic remain largely unexplored.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes every RL algorithm for LLM post-training into three construction stages: how the MDP is defined, how exploration is performed, and how learning updates occur along the classic axes of model-free versus model-based, value-based versus policy-based, on-policy versus off-policy, and Monte Carlo versus bootstrapping credit assignment. Mapping the existing literature onto this structure produces a clear picture of uneven coverage, with dense activity only in a few cells and empty or sparse cells in others despite mature counterparts in classical RL. A reader cares because the gaps are presented as direct, actionable opportunities to import established techniques rather than as abstract absences.

Core claim

By framing LLM reinforcement learning around MDP creation (reward function, state and action spaces, termination, discount), exploration (temperature sampling, entropy regularization, intrinsic motivation, tree search, curriculum), and learning (model-free/model-based, value/policy/actor-critic, on/off-policy, Monte Carlo versus bootstrapping credit assignment), the survey demonstrates that current work concentrates overwhelmingly in critic-free policy gradients and Monte Carlo methods while value-based approaches, off-policy actor-critic training, and bootstrapping-based credit assignment are almost absent, even though each has well-established use in classical RL.

What carries the argument

The three-stage taxonomy (MDP creation, exploration, learning) that decomposes every RL design decision and serves as the coordinate system for mapping the LLM literature distribution.

If this is right

  • Value-based methods can be directly tested as alternatives to current policy-gradient pipelines.
  • Off-policy actor-critic algorithms become candidate replacements for on-policy methods like PPO.
  • Bootstrapping credit assignment offers a route to lower-variance updates than pure Monte Carlo returns.
  • The taxonomy supplies a common vocabulary that lets RL researchers and LLM practitioners identify transfer opportunities.
  • Filling the identified gaps constitutes concrete next steps rather than open-ended exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same taxonomy could be applied to RL in other large-model domains to check whether the same clustering pattern appears.
  • Computational cost or stability concerns specific to large models may explain some of the observed gaps and could be tested by scaling classical methods.
  • Hybrid algorithms that combine the dense areas with the sparse ones become natural targets for empirical work.
  • The framework makes it possible to quantify future progress by tracking how the distribution of papers changes over time.

Load-bearing premise

The three-stage taxonomy fully captures the design decisions that matter in RL algorithms applied to LLMs and the literature mapping has no large systematic omissions.

What would settle it

A follow-up survey that locates substantial published work on value-based methods or off-policy actor-critic training for LLMs, or a controlled experiment in which bootstrapping credit assignment fails to produce stable updates on standard LLM post-training tasks.

Figures

Figures reproduced from arXiv: 2606.21943 by Annie Wong, Aske Plaat, Chao Gao, Filip Ilievski, Hengyuan Zhang, Jacob E. Kooi, Jiayang Shi, Kevin Qiu, Lincen Yang, Mark Hoogendoorn, Ngai Wong, Qi Huang, Shiping Yang, Shujian Yu, Ting-Chih Chen, Vincent Fran\c{c}ois-Lavet, Xinrui Zu, Yuxuan Jiang, Zhaochun Ren, Zhao Yang, Zhong Li.

Figure 1
Figure 1. Figure 1: Taxonomy of this survey, organized around four core components of applying RL to LLM training: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A brief illustration of five exploration strategies in RL-based LLM training, where the shaded region represents the exploration [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of the four key design dimensions in RL-based LLM training. Data representation (Section [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Categorization of RL methods for LLM training by solution representation. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has become central to LLM post-training, yet the methods that dominate current pipelines, PPO and GRPO, represent only a narrow slice of what RL offers. Understanding why these methods prevail, and what alternatives exist, requires a principled examination of the design decisions that underlie any RL algorithm. This survey organizes that examination around three stages of algorithm construction. We begin with MDP creation: how the reward function, state space, action space, termination condition, and discount factor are, or could be, defined for LLM training. We then turn to exploration, covering temperature sampling, entropy regularization, intrinsic motivation, tree search, and curriculum learning. Finally, we address learning along four classical RL dimensions: model-free versus model-based, value-based versus policy-based versus actor-critic, on-policy versus off-policy, and credit assignment, including both Monte Carlo methods, which rely on full return estimates, and bootstrapping methods, which update estimates using other learned predictions. Mapping the LLM literature onto this taxonomy reveals a strikingly non-uniform distribution of research effort. Critic-free policy gradients and Monte Carlo credit assignment are densely populated, while value-based methods, off-policy actor-critic training, and bootstrapping-based credit assignment remain largely unexplored despite well-established counterparts in classical RL. These gaps represent concrete opportunities for transferring proven RL techniques to LLM training. By making these gaps explicit alongside the methods that have proven effective, this survey offers researchers in both RL and LLMs a shared framework for understanding current practice and identifying promising directions for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper surveys RL methods for LLM post-training, organizing algorithm design into three stages: MDP creation (defining reward, state/action spaces, termination, discount), exploration (temperature sampling, entropy, intrinsic motivation, tree search, curriculum), and learning (model-free vs model-based; value-based vs policy-based vs actor-critic; on-policy vs off-policy; Monte Carlo vs bootstrapping credit assignment). It maps LLM literature onto this taxonomy and asserts a non-uniform distribution, with dense coverage of critic-free policy gradients and Monte Carlo methods but large gaps in value-based methods, off-policy actor-critic, and bootstrapping despite classical RL precedents.

Significance. If the literature mapping is systematic and complete, the taxonomy supplies a shared framework that could help RL and LLM researchers identify transferable techniques and prioritize under-explored directions such as value-based or bootstrapped methods in LLM training.

major comments (2)
  1. [Abstract] Abstract: The central claim of a 'strikingly non-uniform distribution of research effort' with specific gaps ('value-based methods, off-policy actor-critic training, and bootstrapping-based credit assignment remain largely unexplored') is load-bearing for the paper's contribution, yet no search protocol, databases, keywords, date range, inclusion criteria, or quantitative paper counts per taxonomy cell are provided. This prevents verification that the asserted gaps are not due to selection bias or omissions.
  2. [Abstract] Abstract and taxonomy description: The three-stage organization is presented as comprehensively capturing 'the design decisions that underlie any RL algorithm,' but the manuscript does not address how hybrid or LLM-specific adaptations (e.g., reward models that blur MDP creation and learning) are classified, which could affect the completeness of the gap analysis.
minor comments (1)
  1. [Abstract] The acronym GRPO is used without expansion in the abstract; it should be defined on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the transparency and completeness of the taxonomy without altering the core contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of a 'strikingly non-uniform distribution of research effort' with specific gaps ('value-based methods, off-policy actor-critic training, and bootstrapping-based credit assignment remain largely unexplored') is load-bearing for the paper's contribution, yet no search protocol, databases, keywords, date range, inclusion criteria, or quantitative paper counts per taxonomy cell are provided. This prevents verification that the asserted gaps are not due to selection bias or omissions.

    Authors: We agree that explicit documentation of the literature scope would improve verifiability. The survey is primarily a conceptual taxonomy organized around classical RL design decisions rather than a quantitative meta-analysis; the observed non-uniformity is illustrated through prominent, representative works rather than exhaustive counts. In revision we will add a short 'Scope and Literature Selection' subsection that states the primary sources (arXiv cs.LG and cs.CL sections, NeurIPS/ICLR/ICML/ACL proceedings 2022–2024), core keyword combinations used, and the decision rule that a paper is mapped if it introduces or applies an RL algorithm to LLM post-training. We will also qualify the gap claims as 'under-represented relative to classical RL literature and to the density of critic-free on-policy Monte Carlo methods' rather than asserting absolute absence. revision: yes

  2. Referee: [Abstract] Abstract and taxonomy description: The three-stage organization is presented as comprehensively capturing 'the design decisions that underlie any RL algorithm,' but the manuscript does not address how hybrid or LLM-specific adaptations (e.g., reward models that blur MDP creation and learning) are classified, which could affect the completeness of the gap analysis.

    Authors: We accept the observation. Reward-model training in LLM pipelines indeed straddles MDP creation (reward definition) and learning (optimization of the reward model itself). In the revised manuscript we will insert a clarifying paragraph in the taxonomy overview that states the classification rule: a component is placed under the stage it primarily modifies (reward models under MDP creation when they define the scalar reward signal; under learning when the focus is the optimization procedure). Overlaps will be explicitly noted with cross-references, and an LLM-specific example (e.g., process reward models) will be added to illustrate the handling of hybrids. This addition preserves the three-stage structure while addressing potential boundary cases. revision: yes

Circularity Check

0 steps flagged

No circularity: survey classifies literature without derivations or self-referential predictions

full rationale

This is a survey paper that proposes a three-stage taxonomy (MDP creation, exploration, learning) and maps existing LLM RL literature onto it. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. The central claim of non-uniform research effort is an empirical observation about prior work rather than a self-defined or fitted result. Self-citations, if present, are not load-bearing for any mathematical claim. The work is self-contained as a classification exercise with no circular steps matching the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the three-stage taxonomy is a useful and complete lens for RL algorithm design, plus the implicit assumption that the surveyed LLM literature is representative.

axioms (1)
  • domain assumption The three stages (MDP creation, exploration, learning) and four classical dimensions cover the key design decisions in any RL algorithm.
    Invoked in the abstract when organizing the examination around these stages.

pith-pipeline@v0.9.1-grok · 5896 in / 1130 out tokens · 29572 ms · 2026-06-26T12:13:22.843907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

287 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. 2025. The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134(2025)

  2. [2]

    Pranjal Aggarwal and Sean Welleck. 2025. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697(2025)

  3. [3]

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740(2024)

  4. [4]

    Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. 2019. Understanding the impact of entropy on policy optimization. InInternational conference on machine learning. PMLR, 151–160

  5. [5]

    Susan Amin, Maziar Gomrokchi, Harsh Satija, Herke Van Hoof, and Doina Precup. 2021. A survey of exploration methods in reinforcement learning.arXiv preprint arXiv:2109.00157(2021)

  6. [6]

    Ron Amit, Ron Meir, and Kamil Ciosek. 2020. Discount factor as a regularizer in reinforcement learning. InInternational conference on machine learning. PMLR, 269–278

  7. [7]

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. 2017. Hindsight experience replay.Advances in neural information processing systems30 (2017)

  8. [8]

    Yonatan Ashlag, Uri Koren, Mirco Mutti, Esther Derman, Pierre-Luc Bacon, and Shie Mannor. 2025. State Entropy Regularization for Robust Reinforcement Learning.arXiv preprint arXiv:2506.07085(2025)

  9. [9]

    Arthur Aubret, Laetitia Matignon, and Salima Hassas. 2019. A survey on intrinsic motivation in reinforcement learning.arXiv preprint arXiv:1908.06976(2019)

  10. [10]

    Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesvári, and Karim Bouyarmane. 2025. Learning to Reason Efficiently with Discounted Reinforcement Learning.arXiv preprint arXiv:2510.23486(2025)

  11. [11]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022)

  12. [12]

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. 2025. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926(2025)

  13. [13]

    Toygun Basaklar, Suat Gumussoy, and Umit Y Ogras. 2022. Pd-morl: Preference-driven multi-objective reinforcement learning algorithm.arXiv preprint arXiv:2208.07914(2022)

  14. [14]

    Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon Whiteson. 2023. A survey of meta-reinforcement learning.arXiv e-prints(2023), arXiv–2301

  15. [15]

    Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150(2020)

  16. [16]

    Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. 2002. The complexity of decentralized control of Markov decision processes.Mathematics of operations research27, 4 (2002), 819–840

  17. [17]

    Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika39, 3/4 (1952), 324–345

  18. [18]

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. 2018. Exploration by random network distillation.arXiv preprint arXiv:1810.12894 (2018)

  19. [19]

    Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, and Ashley Llorens. 2023. Distributional pareto-optimal multi-objective reinforcement learning.Advances in Neural Information Processing Systems36 (2023), 15593–15613. Manuscript submitted to ACM 32 Zhao Yang et al

  20. [20]

    Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, and Lei Meng. 2024. Beyond sparse rewards: Enhancing reinforcement learning with language model critique in text generation.arXiv preprint arXiv:2401.07382(2024)

  21. [21]

    Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, and Furong Huang. 2024. Transfer q-star: Principled decoding for llm alignment.Advances in Neural Information Processing Systems37 (2024), 101725–101761

  22. [22]

    Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. 2021. Goal-conditioned reinforcement learning with imagined subgoals. InInternational conference on machine learning. PMLR, 1430–1440

  23. [23]

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. 2024. Alphamath almost zero: process supervision without process.Advances in Neural Information Processing Systems37 (2024), 27689–27724

  24. [24]

    Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schuetze, and Kam-Fai Wong. 2026. EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id= ObF4WIMkY6

  25. [25]

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595(2023)

  26. [26]

    Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Jian Zhang, and Xiaoguang Niu. 2024. Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay.arXiv preprint arXiv:2410.12236(2024)

  27. [27]

    Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, and Ji-Rong Wen. 2024. Improving large language models via fine-grained reinforcement learning with minimum editing constraint.arXiv preprint arXiv:2401.06081(2024)

  28. [28]

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. 2025. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758(2025)

  29. [29]

    Jie Cheng, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, and Fei-Yue Wang. 2025. Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275(2025)

  30. [30]

    Nuttapong Chentanez, Andrew Barto, and Satinder Singh. 2004. Intrinsically motivated reinforcement learning.Advances in neural information processing systems17 (2004)

  31. [31]

    Petros Christodoulou. 2019. Soft actor-critic for discrete action settings.arXiv preprint arXiv:1910.07207(2019)

  32. [32]

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. 2025. Gpg: A simple and strong reinforcement learning baseline for model reasoning.arXiv preprint arXiv:2504.02546(2025)

  33. [33]

    Pierre Clavier, Nathan Grinsztajn, Raphael Avalos, Yannis Flet-Berliac, Irem Ergun, Omar D Domingues, Eugene Tarassov, Olivier Pietquin, Pierre H Richemond, Florian Strub, et al. 2025. ShiQ: Bringing back Bellman to LLMs.arXiv preprint arXiv:2505.11081(2025)

  34. [34]

    Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning.arXiv preprint arXiv:2205.09712(2022)

  35. [35]

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. 2025. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617(2025)

  36. [36]

    Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, et al. 2025. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675(2025)

  37. [37]

    2026.DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

    DeepSeek-AI. 2026.DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. Technical Report. DeepSeek-AI. https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf Technical report

  38. [38]

    Yihe Deng, I Hsu, Jun Yan, Zifeng Wang, Rujun Han, Gufeng Zhang, Yanfei Chen, Wei Wang, Tomas Pfister, Chen-Yu Lee, et al. 2025. Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning.arXiv preprint arXiv:2510.25992(2025)

  39. [39]

    Yixin Dong, Charlie F Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. 2025. Xgrammar: Flexible and efficient structured generation engine for large language models.Proceedings of Machine Learning and Systems7 (2025)

  40. [40]

    Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. 2025. Improving rl exploration for llm reasoning through retrospective replay.arXiv preprint arXiv:2504.14363(2025)

  41. [41]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  42. [42]

    Ayoub Echchahed and Pablo Samuel Castro. 2025. A survey of state representation learning for deep reinforcement learning.arXiv preprint arXiv:2506.17518(2025)

  43. [43]

    Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. 2021. First return, then explore.Nature590, 7847 (2021), 580–586

  44. [44]

    Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, et al. 2025. Truncated Proximal Policy Optimization.arXiv preprint arXiv:2506.15050(2025)

  45. [45]

    Meng Fang, Tianyi Zhou, Yali Du, Lei Han, and Zhengyou Zhang. 2019. Curriculum-guided hindsight experience replay.Advances in neural information processing systems32 (2019)

  46. [46]

    William Fedus, Carles Gelada, Yoshua Bengio, Marc G Bellemare, and Hugo Larochelle. 2019. Hyperbolic discounting and learning over multiple horizons.arXiv preprint arXiv:1902.06865(2019)

  47. [47]

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025. ReTool: Reinforcement Learning for Strategic Tool Use in LLMs.arXiv preprint arXiv:2504.11536(2025). https://doi.org/10.48550/arXiv.2504.11536 Manuscript submitted to ACM Modularized Reinforcement Learning on LLMs: From MDP Crea...

  48. [48]

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. 2025. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978 (2025)

  49. [49]

    Vincent François-Lavet, Raphael Fonteneau, and Damien Ernst. 2015. How to discount deep reinforcement learning: Towards new dynamic strategies.arXiv preprint arXiv:1512.02011(2015)

  50. [50]

    Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al . 2018. An introduction to deep reinforcement learning.Foundations and Trends®in Machine Learning11, 3-4 (2018), 219–354

  51. [51]

    Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning. PMLR, 1587–1596

  52. [52]

    Scott Fujimoto, David Meger, and Doina Precup. 2020. An equivalence between loss functions and non-uniform sampling in experience replay. Advances in neural information processing systems33 (2020), 14219–14230

  53. [53]

    Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, and Mario Martin. 2024. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811(2024)

  54. [54]

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)

  55. [55]

    Jingyue Gao, Runji Lin, Keming Lu, Bowen Yu, Junyang Lin, and Jianyu Chen. [n. d.]. MARGE: Improving Math Reasoning with Guided Exploration. InForty-second International Conference on Machine Learning

  56. [56]

    Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, and Xiangyu Zhao. 2025. Navigate the unknown: Enhancing llm reasoning with intrinsic motivation guided exploration.arXiv preprint arXiv:2505.17621(2025)

  57. [57]

    Google DeepMind. 2026. Gemma 4 12B: A Unified, Encoder-Free Multimodal Model. Google AI Blog. https://blog.google/innovation-and- ai/technology/developers-tools/introducing-gemma-4-12b/ Accessed: 2026-06-05

  58. [58]

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. 2025. RStar-math: Small LLMs can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519(2025)

  59. [59]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  60. [60]

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning. PMLR, 1352–1361

  61. [61]

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. 2018. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905(2018)

  62. [62]

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2025. Mastering diverse control tasks through world models.Nature(2025), 1–7

  63. [63]

    Nicklas Hansen, Hao Su, and Xiaolong Wang. 2023. Td-mpc2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828 (2023)

  64. [64]

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. 2024. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769(2024)

  65. [65]

    Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. 2025. On-Policy RL with Optimal Reward Baseline.arXiv preprint arXiv:2505.23585(2025)

  66. [66]

    Conor F Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. 2022. A practical guide to multi-objective reinforcement learning and planning: CF Hayes et al. Autonomous Agents and Multi-Agent Systems36, 1 (2022), 26

  67. [67]

    Haoran He, Yuxiao Ye, Qingpeng Cai, Chen Hu, Binxing Jiao, Daxin Jiang, and Ling Pan. 2025. Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards. arXiv:2509.24981 [cs.LG] https://arxiv.org/abs/2509.24981

  68. [68]

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al . 2025. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312(2025)

  69. [69]

    Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

  70. [70]

    Haitao Hong, Yuchen Yan, Xingyu Wu, Guiyang Hou, Wenqi Zhang, Weiming Lu, Yongliang Shen, and Jun Xiao. 2025. Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models.arXiv preprint arXiv:2508.05613(2025)

  71. [71]

    Joey Hong, Anca Dragan, and Sergey Levine. 2024. Q-sft: Q-learning for language models via supervised fine-tuning.arXiv preprint arXiv:2411.05193 (2024)

  72. [72]

    Joey Hong, Anca Dragan, and Sergey Levine. 2025. Planning without search: Refining frontier llms with offline goal-conditioned rl.arXiv preprint arXiv:2505.18098(2025)

  73. [73]

    Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. 2025. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296(2025)

  74. [74]

    Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. 2025. TreeRL: LLM Reinforcement Learning with On-Policy Tree Search. arXiv preprint arXiv:2506.11902(2025). Manuscript submitted to ACM 34 Zhao Yang et al

  75. [75]

    Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. Advances in neural information processing systems29 (2016)

  76. [76]

    Bokai Hu, Sai Ashish Somayajula, Xin Pan, and Pengtao Xie. 2024. Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning.arXiv preprint arXiv:2410.11020(2024)

  77. [77]

    Jian Hu. 2025. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262(2025)

  78. [78]

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. 2025. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290(2025)

  79. [79]

    Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. 2024. The n+ implementation details of rlhf with ppo: A case study on tl; dr summarization.arXiv preprint arXiv:2403.17031(2024)

  80. [80]

    Dom Huh and Prasant Mohapatra. 2023. Multi-agent reinforcement learning: A comprehensive survey.arXiv preprint arXiv:2312.10256(2023)

Showing first 80 references.