pith. machine review for the scientific record. sign in

arxiv: 2509.08827 · v3 · pith:LWH4ADZ4new · submitted 2025-09-10 · 💻 cs.CL · cs.AI· cs.LG

A Survey of Reinforcement Learning for Large Reasoning Models

Pith reviewed 2026-05-17 23:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reinforcement learninglarge language modelslarge reasoning modelssurveyreasoningmathematicscodingscalability
0
0 comments X

The pith

Reinforcement learning has become the main approach for turning large language models into strong reasoners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys recent work on using reinforcement learning to boost reasoning in large language models, especially for math and coding tasks. It argues that RL is now a core technique for building large reasoning models and that progress since DeepSeek-R1 has revealed serious scaling obstacles. The authors review foundational components, core problems, available training resources, and real-world applications to chart a path forward. Their goal is to help the field move toward more scalable systems that could support artificial superintelligence.

Core claim

The paper claims that RL has emerged as a foundational methodology for transforming LLMs into LRMs because of its success on complex logical tasks, and that the field's rapid growth since DeepSeek-R1 now requires a dedicated survey to confront challenges in computational resources, algorithm design, training data, and infrastructure on the way to ASI.

What carries the argument

A structured survey that organizes RL-for-reasoning research into foundational components, core problems, training resources, and downstream applications.

Load-bearing premise

The papers chosen for review accurately represent the main advances and obstacles in the field since DeepSeek-R1.

What would settle it

A high-performing reasoning system that matches or exceeds current LRMs while using little or no reinforcement learning would undermine the claim that RL is foundational.

read the original abstract

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript surveys recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs), with particular emphasis on post-DeepSeek-R1 literature. It claims that RL has emerged as a foundational methodology for transforming LLMs into Large Reasoning Models (LRMs), reviews foundational components, core problems, training resources, and downstream applications, and identifies future directions for scaling RL toward Artificial SuperIntelligence (ASI). An accompanying GitHub awesome list is provided to support the survey.

Significance. If the coverage is representative, the survey provides a timely consolidation of a fast-moving area, highlighting scaling bottlenecks in compute, algorithms, data, and infrastructure. The curated GitHub list is a concrete strength that offers reproducible access to the referenced literature and could accelerate follow-on work on reasoning models.

minor comments (2)
  1. Abstract: the claim that RL is now 'foundational' is presented as a synthesis of the reviewed literature; a short explicit subsection or table summarizing the key performance gains cited from math and coding benchmarks would make the grounding of this claim more transparent to readers.
  2. The manuscript references a large number of recent works; adding a brief statement on the search strategy, inclusion criteria, or cutoff date used to compile the survey would help readers evaluate completeness.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We are pleased that the survey is recognized as a timely consolidation of the post-DeepSeek-R1 literature on RL for LRMs and that the GitHub awesome list is noted as a concrete contribution for reproducibility.

Circularity Check

0 steps flagged

No significant circularity: survey of external literature only

full rationale

This is a survey paper reviewing published advances in RL for LLM reasoning, with no original derivations, equations, fitted parameters, or predictions. Central claims rest on documented external progress (e.g., post-DeepSeek-R1 results) and an external GitHub resource list. No load-bearing self-citations, self-definitional steps, or reductions of results to the survey's own inputs exist. The paper is self-contained as a review and does not introduce any circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey the paper does not introduce new free parameters or invented entities; it rests on the domain assumption that RL has demonstrated success on reasoning tasks and that the post-DeepSeek-R1 literature is sufficiently mature to survey.

axioms (1)
  • domain assumption Reinforcement learning constitutes a foundational methodology for improving reasoning capabilities in large language models
    Invoked in the abstract when stating that RL has emerged as foundational after successes in math and coding tasks.

pith-pipeline@v0.9.0 · 5878 in / 1192 out tokens · 37990 ms · 2026-05-17T23:58:34.132665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

    cs.LG 2026-03 unverdicted novelty 8.0

    Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

  2. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...

  3. Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...

  4. SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

    cs.RO 2026-04 unverdicted novelty 7.0

    SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.

  5. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  6. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  7. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 6.0

    Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...

  8. Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

    cs.CV 2026-05 unverdicted novelty 6.0

    Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...

  9. Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...

  10. CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

    cs.CV 2026-04 unverdicted novelty 6.0

    CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.

  11. SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    Multi-ORFT improves closed-loop multi-agent driving planners by coupling scene-consistent diffusion pre-training with stable online RL post-training, reducing collisions and off-road rates while increasing speed on th...

  12. STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

    cs.CL 2026-02 unverdicted novelty 6.0

    STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.

  13. The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

    cs.LG 2026-01 unverdicted novelty 6.0

    TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.

  14. Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

    cs.CL 2025-12 unverdicted novelty 6.0

    NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.

  15. Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

    cs.CV 2026-04 unverdicted novelty 5.0

    PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.

  16. StaRPO: Stability-Augmented Reinforcement Policy Optimization

    cs.AI 2026-04 unverdicted novelty 5.0

    StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.

  17. Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

    cs.CL 2026-04 accept novelty 5.0

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

  18. POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

    cs.CL 2025-10 unverdicted novelty 5.0

    POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context...

  19. Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

    cs.LG 2025-10 unverdicted novelty 5.0

    Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve p...

  20. Agentic Reasoning for Large Language Models

    cs.AI 2026-01 unverdicted novelty 4.0

    The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 71 internal anchors

  1. [1]

    Am-deepseek-r1-0528-distilled, June 2025

    a-m team. Am-deepseek-r1-0528-distilled, June 2025. URL https://github.com/a-m-team/a-m-models

  2. [2]

    Phi-4-reasoning Technical Report

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318, 2025

  3. [3]

    Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35: 0 28955--28971, 2022

  4. [4]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arxiv preprint arXiv: 2508.10925, 2025 a

  5. [5]

    The unreasonable effec- tiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134,

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025 b

  6. [6]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025

  7. [7]

    Scaling laws for generative mixed-modal language models

    Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265--279. PMLR, 2023

  8. [8]

    Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines

    Constantin Ahlmann-Eltze, Wolfgang Huber, and Simon Anders. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nature Methods, pages 1657--1661, 2025

  9. [9]

    OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

    Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

  10. [10]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024

  11. [11]

    Kimi-researcher: End-to-end rl training for emerging agentic capabilities

    Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https://moonshotai.github.io/Kimi-Researcher/, 2025. Accessed: 2025-08-13

  12. [12]

    InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

    Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, et al. Inquiremobile: Teaching vlm-based mobile agent to request human assistance via reinforcement fine-tuning. arXiv preprint arXiv:2508.19679, 2025

  13. [13]

    Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models

    Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387, 2025

  14. [14]

    Qwen3-vl: Sharper vision, deeper thought, broader action, 2025

    Alibaba-Qwen. Qwen3-vl: Sharper vision, deeper thought, broader action, 2025. URL https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list

  15. [15]

    Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

    Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URL https://hkunlp.github.io/blog/2025/Polaris

  16. [16]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674--3683, 2018

  17. [17]

    David Anugraha, Zilu Tang, Lester James V

    Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models. arXiv preprint arXiv:2408.11791, 2024

  18. [18]

    Claude 3.7 sonnet and claude code, 2025 a

    Anthropic . Claude 3.7 sonnet and claude code, 2025 a . URL https://www.anthropic.com/news/claude-3-7-sonnet

  19. [19]

    Claude opus 4.1, 2025 b

    Anthropic . Claude opus 4.1, 2025 b . URL https://www.anthropic.com/claude/opus

  20. [20]

    Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679,

    Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

  21. [21]

    Bradley Butcher, Michael O’Keefe, and James Titchener

    Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025

  22. [22]

    Gazal-r1: Scaling medical reasoning with grpo and multi-component reward design, 2025

    Pranav Arora, Rohan Gupta, and Kavya Patel. Gazal-r1: Scaling medical reasoning with grpo and multi-component reward design, 2025. URL https://arxiv.org/abs/2506.21594

  23. [23]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

    Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. arXiv preprint arXiv:2503.09573, 2025

  24. [24]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

  25. [25]

    Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411, 2025

  26. [26]

    Intern-s1: A scientific multimodal foundation model

    Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, et al. Intern-s1: A scientific multimodal foundation model. arXiv preprint arXiv:2508.15763, 2025

  27. [27]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a

  28. [28]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022 b

  29. [29]

    Ernie 4.5 technical report

    Baidu. Ernie 4.5 technical report. https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf, 2025

  30. [30]

    Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926,

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025

  31. [31]

    Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, and et al

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, and et al. Genie 3: A new frontier for world models. https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/, 2025

  32. [32]

    Language agents for hypothesis-driven clinical decision making with reinforcement learning, 2025

    David Bani-Harouni. Language agents for hypothesis-driven clinical decision making with reinforcement learning, 2025. URL https://arxiv.org/abs/2506.13474

  33. [33]

    Zero-shot model-based reinforcement learning using large language models

    Abdelhakim Benechehab, Youssef Attia El Hili, Ambroise Odonnat, Oussama Zekri, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Ievgen Redko, and Bal \'a zs K \'e gl. Zero-shot model-based reinforcement learning using large language models. arXiv preprint arXiv:2410.11711, 2024

  34. [34]

    Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arxiv preprint arXiv: 2505.00949, 2025

  35. [35]

    Comps: Continual meta policy search

    Glen Berseth, Zhiwei Zhang, Grace Zhang, Chelsea Finn, and Sergey Levine. Comps: Continual meta policy search. arXiv preprint arXiv:2112.04467, 2021

  36. [36]

    Language models that think, chat better

    Adithya Bhaskar, Xi Ye, and Danqi Chen. Language models that think, chat better. arXiv preprint arXiv:2509.20357, 2025

  37. [37]

    OwkinZero : Accelerating biological discovery with AI

    Nathan Bigaud, Vincent Cabeli, Meltem Gürel, Arthur Pignet, John Klein, Gilles Wainrib, and Eric Durand. OwkinZero : Accelerating biological discovery with AI . arXiv preprint arXiv: 2508.16315, 2025

  38. [38]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024 a

  39. [39]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=YCWjhGrJFD

  40. [40]

    Autonomous chemical research with large language models

    Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023

  41. [41]

    Preference-based alignment of discrete diffusion models

    Umberto Borso, Davide Paglieri, Jude Wells, and Tim Rockt \"a schel. Preference-based alignment of discrete diffusion models. arXiv preprint arXiv:2503.08295, 2025

  42. [42]

    Settling the reward hypothesis

    Michael Bowling, John D Martin, David Abel, and Will Dabney. Settling the reward hypothesis. In International Conference on Machine Learning, pages 3003--3020. PMLR, 2023

  43. [43]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

  44. [44]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  45. [45]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

  46. [46]

    How to build the virtual cell with artificial intelligence: Priorities and opportunities

    Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B Burkhardt, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell, 187 0 (25): 0 7045--7063, 2024

  47. [47]

    Weak-to-stronggeneralization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

  48. [48]

    A comprehensive survey of multiagent reinforcement learning

    Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38 0 (2): 0 156--172, 2008

  49. [49]

    Multi-agent reinforcement learning: A review of challenges and applications

    Lorenzo Canese, Gian Carlo Cardarilli, Luca Di Nunzio, Rocco Fazzolari, Daniele Giardino, Marco Re, and Sergio Span \`o . Multi-agent reinforcement learning: A review of challenges and applications. Applied Sciences, 11 0 (11): 0 4948, 2021

  50. [50]

    Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025

  51. [51]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

  52. [52]

    Semi-supervised classification by low density separation

    Olivier Chapelle and Alexander Zien. Semi-supervised classification by low density separation. In International workshop on artificial intelligence and statistics, pages 57--64. PMLR, 2005

  53. [53]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025 a

  54. [54]

    xverify: Efficient answer verifier for reasoning model evalua- tions.arXiv preprint arXiv:2504.10481,

    Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, and Zhiyu Li. xverify: Efficient answer verifier for reasoning model evaluations. arXiv preprint arXiv:2504.10481, 2025 b

  55. [55]

    Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

    Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B \"o rje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023 a

  56. [56]

    Bridging supervised learning and reinforcement learning in math reasoning

    Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning. arXiv preprint arXiv:2505.18116, 2025 c

  57. [57]

    Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles

    Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, et al. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914, 2025 d

  58. [58]

    HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms, 2024 a . URL https://arxiv.org/abs/2412.18925

  59. [59]

    Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation

    Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation. arXiv preprint arXiv:2508.13587, 2025 e

  60. [60]

    Chart-r1: Chain-of-thought supervision and reinforcement for advanced chart reasoner

    Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Yufeng Zhong, and Lin Ma. Chart-r1: Chain-of-thought supervision and reinforcement for advanced chart reasoner. arXiv preprint arXiv:2507.15509, 2025 f

  61. [61]

    G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning

    Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, and Baobao Chang. G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning. arXiv preprint arXiv:2505.13426, 2025 g

  62. [62]

    Beyond two-stage training: Cooperative sft and rl for llm reasoning, 2025 h

    Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning, 2025 h . URL https://arxiv.org/abs/2509.06948

  63. [63]

    Self-questioning language models

    Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self-questioning language models. arXiv preprint arXiv:2508.03682, 2025 i

  64. [64]

    Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346,

    Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization. arXiv preprint arXiv:2505.12346, 2025 j

  65. [65]

    ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

    Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025 k

  66. [66]

    Judgelrm: Large reasoning models as a judge, 2025a

    Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025 l

  67. [67]

    Stepwise guided policy optimization: Coloring your incorrect reasoning in grpo

    Peter Chen, Xiaopeng Li, Ziniu Li, Xi ChenD, and Tianyi Lin. Stepwise guided policy optimization: Coloring your incorrect reasoning in grpo. arXiv preprint arXiv:2505.11595, 2025 m

  68. [68]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025 n

  69. [69]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088, 2025 o

  70. [70]

    Self-evolving curriculum for llm reasoning

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Pich \'e , Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning. arXiv preprint arXiv:2505.14970, 2025 p

  71. [71]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024 b

  72. [72]

    arXiv preprint arXiv:2505.02387 , year=

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025 q

  73. [73]

    Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models

    Yan Chen, Long Li, Teng Xi, Long Zeng, and Jingdong Wang. Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models. arXiv preprint arXiv:2509.13031, 2025 r

  74. [74]

    Acereason-nemotron: Advancing math and code reasoning through reinforcement learning

    Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. arXiv preprint arXiv:2505.16400, 2025 s

  75. [75]

    R1-code-interpreter: Training llms to reason with code via supervised and reinforcement learning

    Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, and Chuchu Fan. R1-code-interpreter: Training llms to reason with code via supervised and reinforcement learning. arXiv preprint arXiv:2505.21668, 2025 t

  76. [76]

    Bi-dexhands: Towards human-level bimanual dexterous manipulation

    Yuanpei Chen, Yiran Geng, Fangwei Zhong, Jiaming Ji, Jiechuang Jiang, Zongqing Lu, Hao Dong, and Yaodong Yang. Bi-dexhands: Towards human-level bimanual dexterous manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (5): 0 2804--2818, 2023 b

  77. [77]

    Conrft: A reinforced fine-tuning method for vla models via consistency policy,

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025 u

  78. [78]

    Scaling rl to long videos

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos. arXiv preprint arXiv:2507.07966, 2025 v

  79. [79]

    Enhancing llm agents for code generation with possibility and pass-rate prioritized experience replay

    Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Jian Zhang, and Xiaoguang Niu. Enhancing llm agents for code generation with possibility and pass-rate prioritized experience replay. arXiv preprint arXiv:2410.12236, 2024 c

  80. [80]

    Visrl: Intention-driven visual perception via reinforced reasoning

    Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025 w

Showing first 80 references.