pith. machine review for the scientific record. sign in

arxiv: 2503.24290 · v2 · submitted 2025-03-31 · 💻 cs.LG · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Daxin Jiang, Heung-Yeung Shum, Jingcheng Hu, Qi Han, Xiangyu Zhang, Yinmin Zhang

Pith reviewed 2026-05-12 21:53 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords reinforcement learningreasoning modelsPPObase model trainingscaling lawsopen sourcerule-based rewards
0
0 comments X

The pith

A minimalist vanilla PPO setup with rule-based rewards and no KL regularization scales reasoning performance and response length on base models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning for reasoning can succeed with a deliberately simple recipe: standard PPO using generalized advantage estimation at lambda and gamma both equal to 1, paired only with straightforward rule-based rewards and no KL penalty term. When this recipe is applied to the Qwen2.5-32B base model, both benchmark accuracy on math and science tasks and average response length increase steadily with training, reproducing the scaling behavior previously reported for more elaborate pipelines. The same training run also finishes with higher final scores on AIME2024, MATH500, and GPQA Diamond while consuming roughly one-tenth the number of steps required by the comparison pipeline. The authors further show that the learned critic naturally penalizes repetitive output patterns, which improves advantage estimates and keeps training stable.

Core claim

Vanilla PPO with GAE (λ=1, γ=1) and rule-based rewards, without any KL regularization, is sufficient to scale both benchmark performance and response length on the base model, achieving superior results on AIME2024, MATH500, and GPQA Diamond with only 1/10 the training steps of the DeepSeek-R1-Zero pipeline on the identical Qwen2.5-32B base.

What carries the argument

The minimalist RL loop that applies vanilla PPO with GAE parameters fixed at λ=1 and γ=1 together with simple rule-based rewards and no KL term, allowing direct scaling of reasoning capability from the base model.

If this is right

  • Benchmark scores on math and science tasks rise steadily as training proceeds and average response length grows without external length bonuses.
  • The learned critic automatically down-weights repetitive response patterns, supplying more reliable advantage signals and reducing training variance.
  • Training reaches competitive or superior final performance after far fewer gradient steps than pipelines that incorporate additional regularization or curriculum stages.
  • Full reproducibility is obtained by releasing the exact training code, data, and intermediate model checkpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result implies that open implementations can reach or exceed the scaling curves of closed systems by removing rather than adding algorithmic components.
  • Because the method relies only on rule-based rewards, it can be ported to new domains whose correctness can be checked programmatically without needing learned reward models.
  • The observed critic behavior suggests that explicit anti-repetition penalties may be unnecessary once a value head is trained jointly with the policy.

Load-bearing premise

The rule-based rewards together with the specific GAE settings and complete absence of KL regularization will continue to produce stable, improving training when the same recipe is moved to new base models or substantially larger scales.

What would settle it

Applying the identical vanilla PPO recipe with the same reward functions and GAE parameters to a different base model such as Llama-3.1-70B and observing either flat or declining benchmark curves accompanied by unstable advantage estimates or excessive repetition.

read the original abstract

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Open-Reasoner-Zero, an open-source RL pipeline for training reasoning on base LLMs. It claims that a minimalist vanilla PPO with GAE (λ=1, γ=1), rule-based rewards, and no KL regularization suffices to replicate scaling in benchmark performance and response length, achieving superior results on AIME2024, MATH500, and GPQA Diamond versus DeepSeek-R1-Zero-Qwen-32B on the identical Qwen2.5-32B base model while using only 1/10 the training steps. The work further analyzes training dynamics and shows that the learned critic devalues repetitive patterns for more stable advantage estimates.

Significance. If the empirical claims hold under scrutiny, the result would be significant for showing that complex reasoning scaling can emerge from a deliberately simple RL configuration without KL penalties or other regularizers, thereby lowering barriers to open research. The explicit release of source code, training data, and model weights is a clear strength that directly supports reproducibility and community follow-up.

major comments (3)
  1. [Experiments] Experiments section: the performance comparisons (superiority on AIME2024, MATH500, GPQA Diamond) are reported as single-point summaries without error bars, multiple random seeds, or statistical significance tests. This directly affects the load-bearing claim that the minimalist setup reliably outperforms the DeepSeek baseline.
  2. [Method] Method / PPO implementation: the description states 'vanilla PPO ... without any KL regularization,' yet the manuscript does not provide the exact hyper-parameter list, clipping thresholds, or code-level confirmation that no undisclosed filtering, advantage normalization, or auxiliary losses are present. This verification is required to substantiate the central minimalist claim.
  3. [Ablations] Ablation studies: while the abstract states that ablations for critical design choices are covered, the provided results lack complete quantitative tables (e.g., performance deltas when λ, γ, or reward components are altered). Full tables are needed to isolate which elements drive the reported scaling.
minor comments (2)
  1. [Figures] Figure captions for training curves should explicitly label axes, include legend entries for all compared runs, and state the number of seeds if applicable.
  2. [Method] The rule-based reward functions are described as 'straightforward' but would benefit from an explicit formula or pseudocode in the main text or appendix to allow exact replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, with revisions made where the manuscript required strengthening to better support our claims.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the performance comparisons (superiority on AIME2024, MATH500, GPQA Diamond) are reported as single-point summaries without error bars, multiple random seeds, or statistical significance tests. This directly affects the load-bearing claim that the minimalist setup reliably outperforms the DeepSeek baseline.

    Authors: We agree that error bars, multiple seeds, and statistical tests would strengthen the empirical claims. The reported gains are large in magnitude and consistent across benchmarks, but single-run results due to high compute costs for 32B-scale RL do limit statistical robustness. We have revised the Experiments section to explicitly note this limitation, emphasize the open-source code release for independent multi-seed verification, and add any available training variance from logs. No new multi-seed runs were added in this revision. revision: partial

  2. Referee: [Method] Method / PPO implementation: the description states 'vanilla PPO ... without any KL regularization,' yet the manuscript does not provide the exact hyper-parameter list, clipping thresholds, or code-level confirmation that no undisclosed filtering, advantage normalization, or auxiliary losses are present. This verification is required to substantiate the central minimalist claim.

    Authors: We thank the referee for this clarification request. To fully substantiate the minimalist claim, we have added a complete hyperparameter table to the Method section listing all PPO settings (including clipping threshold ε=0.2), GAE parameters, and explicit confirmation of no KL penalty, undisclosed filtering, non-standard advantage normalization, or auxiliary losses. The released source code provides line-by-line verification of the implementation. revision: yes

  3. Referee: [Ablations] Ablation studies: while the abstract states that ablations for critical design choices are covered, the provided results lack complete quantitative tables (e.g., performance deltas when λ, γ, or reward components are altered). Full tables are needed to isolate which elements drive the reported scaling.

    Authors: We agree that more detailed quantitative ablation tables are needed. We have expanded the Ablations section with full tables reporting performance metrics and deltas for changes in λ, γ, and reward components. These tables directly isolate the contributions of each element and support the analysis of the critic's devaluation of repetitive patterns. revision: yes

Circularity Check

0 steps flagged

Empirical RL implementation study with no circular derivation chain

full rationale

The paper reports experimental results from training Qwen2.5-32B with vanilla PPO (GAE λ=1, γ=1, no KL term, rule-based rewards) and measures outcomes on external benchmarks (AIME2024, MATH500, GPQA Diamond). Performance gains and scaling behavior are shown via direct comparison to DeepSeek-R1-Zero and ablations; no equations or predictions reduce by construction to fitted parameters inside the paper, and no load-bearing step relies on self-citation or ansatz smuggling. The derivation chain is the training run itself, which is externally verifiable via released code and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions and empirical outcomes rather than new theoretical constructs or fitted parameters beyond the explicitly fixed GAE values.

axioms (1)
  • standard math Standard PPO and GAE assumptions hold for policy optimization on language model outputs
    The paper invokes vanilla PPO with GAE (λ=1, γ=1) without deriving or proving these settings from first principles.

pith-pipeline@v0.9.0 · 5548 in / 1349 out tokens · 54358 ms · 2026-05-12T21:53:07.813295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

  3. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

  4. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  5. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  6. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  7. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  8. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  9. H\"older Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  10. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  11. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  12. Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

    cs.AI 2026-05 unverdicted novelty 6.0

    Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.

  13. Internalizing Safety Understanding in Large Reasoning Models via Verification

    cs.AI 2026-05 unverdicted novelty 6.0

    Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...

  14. Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

  15. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  16. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  17. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  18. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  19. Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

    cs.LG 2026-04 unverdicted novelty 6.0

    DRRO for RLHF replaces worst-case value with worst-case regret in Wasserstein DRO, producing an exact water-filling solution under l1 ambiguity and a practical sampled-bonus algorithm that reduces proxy over-optimization.

  20. From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space

    cs.IR 2026-04 unverdicted novelty 6.0

    GloRank reformulates list-wise reranking as token generation over a global item identifier space, using supervised pre-training followed by reinforcement learning to maximize list-wise utility and outperforming baseli...

  21. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  22. From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

    cs.LG 2026-04 unverdicted novelty 6.0

    PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

  23. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  24. Can LLMs Learn to Reason Robustly under Noisy Supervision?

    cs.LG 2026-04 conditional novelty 6.0

    Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...

  25. EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...

  26. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  27. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

  28. VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    cs.AI 2025-04 unverdicted novelty 6.0

    VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.

  29. On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

    cs.LG 2026-05 unverdicted novelty 5.0

    RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.

  30. OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

  31. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 29 Pith papers · 6 internal anchors

  1. [1]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-r eason-with-llms/, 2025

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

  4. [4]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  5. [5]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  6. [6]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  7. [7]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

  8. [8]

    Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

  9. [9]

    Exploring the limit of outcome reward for learning mathematical reasoning, 2025

    Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, and Kai Chen. Exploring the limit of outcome reward for learning mathematical reasoning, 2025

  10. [10]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Proceedings of the Neural Information Processing Systems T rack on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021 , 2021

  11. [11]

    Numinamath, 2024

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath, 2024

  12. [12]

    Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris 14 Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajish...

  13. [13]

    Open r1: Evaluating llms on uncontaminated math competitions, February 2025

    Loubna Ben Allal, Lewis Tunstall, Anton Lozhkov, Elie Bakouch, Guilherme Penedo, and Gabriel Martín Blázquez Hynek Kydlicek. Open r1: Evaluating llms on uncontaminated math competitions, February 2025

  14. [14]

    Demystifying long chain-of-thought reasoning in llms

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025

  15. [15]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, 2025

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, 2025

  16. [16]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like Training: A Critical Perspective. arXiv preprint arXiv:2503.20783, 2025

  17. [17]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild, 2025

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild, 2025

  18. [18]

    Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks, 2025

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, and Lin Yan. Vapo: Efficient and reliable rei...

  19. [19]

    Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

    Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025. Notion Blog

  20. [20]

    Skywork open reasoner series

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious -hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a8 0459b46c149e4f51680, 2025. Notion Blog

  21. [21]

    Open Thoughts

    OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

  22. [22]

    Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

    Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of- the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

  23. [23]

    Llama-nemotron: Efficient reasoning models, 2025

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Ziji...

  24. [24]

    Opencodereasoning: Advancing data distillation for competitive coding

    Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Sid- dhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

  25. [25]

    Codeforces

    Guilherme Penedo, Anton Lozhkov, Hynek Kydlíˇ cek, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025

  26. [26]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 16 In this Appendix, we provide more elaboration on the implementation details, experiment results, and qualitative results. Specifical...

  27. [27]

    annealing

    with no bias term. The policy and critic do not share weights during training. For both policy and critic networks, we employ AdamW optimizer with 𝛽 = [0.9, 0.95] without weight decay. The learning rates are set to 1 × 10−6 and 5 × 10−6 for the policy and critic networks, respectively. The learning rate schedulers are both constant learning rate with line...