arxiv: 2509.08827 · v3 · pith:LWH4ADZ4new · submitted 2025-09-10 · 💻 cs.CL · cs.AI· cs.LG

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang , Yuxin Zuo , Bingxiang He , Youbang Sun , Runze Liu , Che Jiang , Yuchen Fan , Kai Tian

show 31 more authors

Guoli Jia Pengfei Li Yu Fu Xingtai Lv Yuchen Zhang Sihang Zeng Shang Qu Haozhan Li Shijie Wang Yuru Wang Xinwei Long Fangfu Liu Xiang Xu Jiaze Ma Xuekai Zhu Ermo Hua Yihao Liu Zonglin Li Huayu Chen Xiaoye Qu Yafu Li Weize Chen Zhenzhao Yuan Junqi Gao Dong Li Zhiyuan Ma Ganqu Cui Zhiyuan Liu Biqing Qi Ning Ding Bowen Zhou

This is my paper

Pith reviewed 2026-05-17 23:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reinforcement learninglarge language modelslarge reasoning modelssurveyreasoningmathematicscodingscalability

0 comments

The pith

Reinforcement learning has become the main approach for turning large language models into strong reasoners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys recent work on using reinforcement learning to boost reasoning in large language models, especially for math and coding tasks. It argues that RL is now a core technique for building large reasoning models and that progress since DeepSeek-R1 has revealed serious scaling obstacles. The authors review foundational components, core problems, available training resources, and real-world applications to chart a path forward. Their goal is to help the field move toward more scalable systems that could support artificial superintelligence.

Core claim

The paper claims that RL has emerged as a foundational methodology for transforming LLMs into LRMs because of its success on complex logical tasks, and that the field's rapid growth since DeepSeek-R1 now requires a dedicated survey to confront challenges in computational resources, algorithm design, training data, and infrastructure on the way to ASI.

What carries the argument

A structured survey that organizes RL-for-reasoning research into foundational components, core problems, training resources, and downstream applications.

Load-bearing premise

The papers chosen for review accurately represent the main advances and obstacles in the field since DeepSeek-R1.

What would settle it

A high-performing reasoning system that matches or exceeds current LRMs while using little or no reinforcement learning would undermine the claim that RL is foundational.

read the original abstract

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript surveys recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs), with particular emphasis on post-DeepSeek-R1 literature. It claims that RL has emerged as a foundational methodology for transforming LLMs into Large Reasoning Models (LRMs), reviews foundational components, core problems, training resources, and downstream applications, and identifies future directions for scaling RL toward Artificial SuperIntelligence (ASI). An accompanying GitHub awesome list is provided to support the survey.

Significance. If the coverage is representative, the survey provides a timely consolidation of a fast-moving area, highlighting scaling bottlenecks in compute, algorithms, data, and infrastructure. The curated GitHub list is a concrete strength that offers reproducible access to the referenced literature and could accelerate follow-on work on reasoning models.

minor comments (2)

Abstract: the claim that RL is now 'foundational' is presented as a synthesis of the reviewed literature; a short explicit subsection or table summarizing the key performance gains cited from math and coding benchmarks would make the grounding of this claim more transparent to readers.
The manuscript references a large number of recent works; adding a brief statement on the search strategy, inclusion criteria, or cutoff date used to compile the survey would help readers evaluate completeness.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We are pleased that the survey is recognized as a timely consolidation of the post-DeepSeek-R1 literature on RL for LRMs and that the GitHub awesome list is noted as a concrete contribution for reproducibility.

Circularity Check

0 steps flagged

No significant circularity: survey of external literature only

full rationale

This is a survey paper reviewing published advances in RL for LLM reasoning, with no original derivations, equations, fitted parameters, or predictions. Central claims rest on documented external progress (e.g., post-DeepSeek-R1 results) and an external GitHub resource list. No load-bearing self-citations, self-definitional steps, or reductions of results to the survey's own inputs exist. The paper is self-contained as a review and does not introduce any circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey the paper does not introduce new free parameters or invented entities; it rests on the domain assumption that RL has demonstrated success on reasoning tasks and that the post-DeepSeek-R1 literature is sufficiently mature to survey.

axioms (1)

domain assumption Reinforcement learning constitutes a foundational methodology for improving reasoning capabilities in large language models
Invoked in the abstract when stating that RL has emerged as foundational after successes in math and coding tasks.

pith-pipeline@v0.9.0 · 5878 in / 1192 out tokens · 37990 ms · 2026-05-17T23:58:34.132665+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
cs.LG 2026-03 unverdicted novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
cs.RO 2026-04 unverdicted novelty 7.0

SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 6.0

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
cs.CV 2026-04 unverdicted novelty 6.0

CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.
SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
cs.RO 2026-04 unverdicted novelty 6.0

Multi-ORFT improves closed-loop multi-agent driving planners by coupling scene-consistent diffusion pre-training with stable online RL post-training, reducing collisions and off-road rates while increasing speed on th...
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
cs.CL 2026-02 unverdicted novelty 6.0

STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
cs.LG 2026-01 unverdicted novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
cs.CL 2025-12 unverdicted novelty 6.0

NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
cs.CV 2026-04 unverdicted novelty 5.0

PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.
StaRPO: Stability-Augmented Reinforcement Policy Optimization
cs.AI 2026-04 unverdicted novelty 5.0

StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
POPI: Personalizing LLMs via Optimized Natural Language Preference Inference
cs.CL 2025-10 unverdicted novelty 5.0

POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context...
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
cs.LG 2025-10 unverdicted novelty 5.0

Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve p...
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 71 internal anchors

[1]

Am-deepseek-r1-0528-distilled, June 2025

a-m team. Am-deepseek-r1-0528-distilled, June 2025. URL https://github.com/a-m-team/a-m-models

work page 2025
[2]

Phi-4-reasoning Technical Report

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Reincarnating reinforcement learning: Reusing prior computation to accelerate progress

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in neural information processing systems, 35: 0 28955--28971, 2022

work page 2022
[4]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arxiv preprint arXiv: 2508.10925, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

The unreasonable effec- tiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134,

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025 b

work page arXiv 2025
[6]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025

work page internal anchor Pith review arXiv 2025
[7]

Scaling laws for generative mixed-modal language models

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265--279. PMLR, 2023

work page 2023
[8]

Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines

Constantin Ahlmann-Eltze, Wolfgang Huber, and Simon Anders. Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines. Nature Methods, pages 1657--1661, 2025

work page 2025
[9]

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Kimi-researcher: End-to-end rl training for emerging agentic capabilities

Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities. https://moonshotai.github.io/Kimi-Researcher/, 2025. Accessed: 2025-08-13

work page 2025
[12]

InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, et al. Inquiremobile: Teaching vlm-based mobile agent to request human assistance via reinforcement fine-tuning. arXiv preprint arXiv:2508.19679, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models

Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models. arXiv preprint arXiv:2502.17387, 2025

work page arXiv 2025
[14]

Qwen3-vl: Sharper vision, deeper thought, broader action, 2025

Alibaba-Qwen. Qwen3-vl: Sharper vision, deeper thought, broader action, 2025. URL https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list

work page 2025
[15]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URL https://hkunlp.github.io/blog/2025/Polaris

work page 2025
[16]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S \"u nderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674--3683, 2018

work page 2018
[17]

David Anugraha, Zilu Tang, Lester James V

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models. arXiv preprint arXiv:2408.11791, 2024

work page arXiv 2024
[18]

Claude 3.7 sonnet and claude code, 2025 a

Anthropic . Claude 3.7 sonnet and claude code, 2025 a . URL https://www.anthropic.com/news/claude-3-7-sonnet

work page 2025
[19]

Claude opus 4.1, 2025 b

Anthropic . Claude opus 4.1, 2025 b . URL https://www.anthropic.com/claude/opus

work page 2025
[20]

Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679,

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

work page arXiv 2025
[21]

Bradley Butcher, Michael O’Keefe, and James Titchener

Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025

work page arXiv 2025
[22]

Gazal-r1: Scaling medical reasoning with grpo and multi-component reward design, 2025

Pranav Arora, Rohan Gupta, and Kavya Patel. Gazal-r1: Scaling medical reasoning with grpo and multi-component reward design, 2025. URL https://arxiv.org/abs/2506.21594

work page arXiv 2025
[23]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. arXiv preprint arXiv:2503.09573, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. arXiv preprint arXiv:2505.20411, 2025

work page arXiv 2025
[26]

Intern-s1: A scientific multimodal foundation model

Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, et al. Intern-s1: A scientific multimodal foundation model. arXiv preprint arXiv:2508.15763, 2025

work page arXiv 2025
[27]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Ernie 4.5 technical report

Baidu. Ernie 4.5 technical report. https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf, 2025

work page 2025
[30]

Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926,

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025

work page arXiv 2025
[31]

Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, and et al

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, and et al. Genie 3: A new frontier for world models. https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/, 2025

work page 2025
[32]

Language agents for hypothesis-driven clinical decision making with reinforcement learning, 2025

David Bani-Harouni. Language agents for hypothesis-driven clinical decision making with reinforcement learning, 2025. URL https://arxiv.org/abs/2506.13474

work page arXiv 2025
[33]

Zero-shot model-based reinforcement learning using large language models

Abdelhakim Benechehab, Youssef Attia El Hili, Ambroise Odonnat, Oussama Zekri, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Ievgen Redko, and Bal \'a zs K \'e gl. Zero-shot model-based reinforcement learning using large language models. arXiv preprint arXiv:2410.11711, 2024

work page arXiv 2024
[34]

Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arxiv preprint arXiv: 2505.00949, 2025

work page arXiv 2025
[35]

Comps: Continual meta policy search

Glen Berseth, Zhiwei Zhang, Grace Zhang, Chelsea Finn, and Sergey Levine. Comps: Continual meta policy search. arXiv preprint arXiv:2112.04467, 2021

work page arXiv 2021
[36]

Language models that think, chat better

Adithya Bhaskar, Xi Ye, and Danqi Chen. Language models that think, chat better. arXiv preprint arXiv:2509.20357, 2025

work page arXiv 2025
[37]

OwkinZero : Accelerating biological discovery with AI

Nathan Bigaud, Vincent Cabeli, Meltem Gürel, Arthur Pignet, John Klein, Gilles Wainrib, and Eric Durand. OwkinZero : Accelerating biological discovery with AI . arXiv preprint arXiv: 2508.16315, 2025

work page arXiv 2025
[38]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=YCWjhGrJFD

work page 2024
[40]

Autonomous chemical research with large language models

Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023

work page 2023
[41]

Preference-based alignment of discrete diffusion models

Umberto Borso, Davide Paglieri, Jude Wells, and Tim Rockt \"a schel. Preference-based alignment of discrete diffusion models. arXiv preprint arXiv:2503.08295, 2025

work page arXiv 2025
[42]

Settling the reward hypothesis

Michael Bowling, John D Martin, David Abel, and Will Dabney. Settling the reward hypothesis. In International Conference on Machine Learning, pages 3003--3020. PMLR, 2023

work page 2023
[43]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[45]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[46]

How to build the virtual cell with artificial intelligence: Priorities and opportunities

Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B Burkhardt, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell, 187 0 (25): 0 7045--7063, 2024

work page 2024
[47]

Weak-to-stronggeneralization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390,

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

work page arXiv 2023
[48]

A comprehensive survey of multiagent reinforcement learning

Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38 0 (2): 0 156--172, 2008

work page 2008
[49]

Multi-agent reinforcement learning: A review of challenges and applications

Lorenzo Canese, Gian Carlo Cardarilli, Luca Di Nunzio, Rocco Fazzolari, Daniele Giardino, Marco Re, and Sergio Span \`o . Multi-agent reinforcement learning: A review of challenges and applications. Applied Sciences, 11 0 (11): 0 4948, 2021

work page 2021
[50]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning. arXiv preprint arXiv:2505.20272, 2025

work page arXiv 2025
[51]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Semi-supervised classification by low density separation

Olivier Chapelle and Alexander Zien. Semi-supervised classification by low density separation. In International workshop on artificial intelligence and statistics, pages 57--64. PMLR, 2005

work page 2005
[53]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

xverify: Efficient answer verifier for reasoning model evalua- tions.arXiv preprint arXiv:2504.10481,

Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, and Zhiyu Li. xverify: Efficient answer verifier for reasoning model evaluations. arXiv preprint arXiv:2504.10481, 2025 b

work page arXiv 2025
[55]

Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B \"o rje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023 a

work page arXiv 2023
[56]

Bridging supervised learning and reinforcement learning in math reasoning

Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning. arXiv preprint arXiv:2505.18116, 2025 c

work page arXiv 2025
[57]

Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles

Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, et al. Enigmata: Scaling logical reasoning in large language models with synthetic verifiable puzzles. arXiv preprint arXiv:2505.19914, 2025 d

work page arXiv 2025
[58]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms, 2024 a . URL https://arxiv.org/abs/2412.18925

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation. arXiv preprint arXiv:2508.13587, 2025 e

work page arXiv 2025
[60]

Chart-r1: Chain-of-thought supervision and reinforcement for advanced chart reasoner

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Yufeng Zhong, and Lin Ma. Chart-r1: Chain-of-thought supervision and reinforcement for advanced chart reasoner. arXiv preprint arXiv:2507.15509, 2025 f

work page arXiv 2025
[61]

G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning

Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, and Baobao Chang. G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning. arXiv preprint arXiv:2505.13426, 2025 g

work page arXiv 2025
[62]

Beyond two-stage training: Cooperative sft and rl for llm reasoning, 2025 h

Liang Chen, Xueting Han, Li Shen, Jing Bai, and Kam-Fai Wong. Beyond two-stage training: Cooperative sft and rl for llm reasoning, 2025 h . URL https://arxiv.org/abs/2509.06948

work page arXiv 2025
[63]

Self-questioning language models

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self-questioning language models. arXiv preprint arXiv:2508.03682, 2025 i

work page arXiv 2025
[64]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346,

Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization. arXiv preprint arXiv:2505.12346, 2025 j

work page arXiv 2025
[65]

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, Fan Yang, et al. Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470, 2025 k

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Judgelrm: Large reasoning models as a judge, 2025a

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025 l

work page arXiv 2025
[67]

Stepwise guided policy optimization: Coloring your incorrect reasoning in grpo

Peter Chen, Xiaopeng Li, Ziniu Li, Xi ChenD, and Tianyi Lin. Stepwise guided policy optimization: Coloring your incorrect reasoning in grpo. arXiv preprint arXiv:2505.11595, 2025 m

work page arXiv 2025
[68]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025 n

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088, 2025 o

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Self-evolving curriculum for llm reasoning

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Pich \'e , Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning. arXiv preprint arXiv:2505.14970, 2025 p

work page arXiv 2025
[71]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

arXiv preprint arXiv:2505.02387 , year=

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025 q

work page arXiv 2025
[73]

Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models

Yan Chen, Long Li, Teng Xi, Long Zeng, and Jingdong Wang. Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models. arXiv preprint arXiv:2509.13031, 2025 r

work page arXiv 2025
[74]

Acereason-nemotron: Advancing math and code reasoning through reinforcement learning

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. arXiv preprint arXiv:2505.16400, 2025 s

work page arXiv 2025
[75]

R1-code-interpreter: Training llms to reason with code via supervised and reinforcement learning

Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, and Chuchu Fan. R1-code-interpreter: Training llms to reason with code via supervised and reinforcement learning. arXiv preprint arXiv:2505.21668, 2025 t

work page arXiv 2025
[76]

Bi-dexhands: Towards human-level bimanual dexterous manipulation

Yuanpei Chen, Yiran Geng, Fangwei Zhong, Jiaming Ji, Jiechuang Jiang, Zongqing Lu, Hao Dong, and Yaodong Yang. Bi-dexhands: Towards human-level bimanual dexterous manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (5): 0 2804--2818, 2023 b

work page 2023
[77]

Conrft: A reinforced fine-tuning method for vla models via consistency policy,

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy. arXiv preprint arXiv:2502.05450, 2025 u

work page arXiv 2025
[78]

Scaling rl to long videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos. arXiv preprint arXiv:2507.07966, 2025 v

work page arXiv 2025
[79]

Enhancing llm agents for code generation with possibility and pass-rate prioritized experience replay

Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Jian Zhang, and Xiaoguang Niu. Enhancing llm agents for code generation with possibility and pass-rate prioritized experience replay. arXiv preprint arXiv:2410.12236, 2024 c

work page arXiv 2024
[80]

Visrl: Intention-driven visual perception via reinforced reasoning

Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025 w

work page arXiv 2025

Showing first 80 references.