Recognition: 3 theorem links
· Lean TheoremOpen-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Pith reviewed 2026-05-12 21:53 UTC · model grok-4.3
The pith
A minimalist vanilla PPO setup with rule-based rewards and no KL regularization scales reasoning performance and response length on base models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vanilla PPO with GAE (λ=1, γ=1) and rule-based rewards, without any KL regularization, is sufficient to scale both benchmark performance and response length on the base model, achieving superior results on AIME2024, MATH500, and GPQA Diamond with only 1/10 the training steps of the DeepSeek-R1-Zero pipeline on the identical Qwen2.5-32B base.
What carries the argument
The minimalist RL loop that applies vanilla PPO with GAE parameters fixed at λ=1 and γ=1 together with simple rule-based rewards and no KL term, allowing direct scaling of reasoning capability from the base model.
If this is right
- Benchmark scores on math and science tasks rise steadily as training proceeds and average response length grows without external length bonuses.
- The learned critic automatically down-weights repetitive response patterns, supplying more reliable advantage signals and reducing training variance.
- Training reaches competitive or superior final performance after far fewer gradient steps than pipelines that incorporate additional regularization or curriculum stages.
- Full reproducibility is obtained by releasing the exact training code, data, and intermediate model checkpoints.
Where Pith is reading between the lines
- The result implies that open implementations can reach or exceed the scaling curves of closed systems by removing rather than adding algorithmic components.
- Because the method relies only on rule-based rewards, it can be ported to new domains whose correctness can be checked programmatically without needing learned reward models.
- The observed critic behavior suggests that explicit anti-repetition penalties may be unnecessary once a value head is trained jointly with the policy.
Load-bearing premise
The rule-based rewards together with the specific GAE settings and complete absence of KL regularization will continue to produce stable, improving training when the same recipe is moved to new base models or substantially larger scales.
What would settle it
Applying the identical vanilla PPO recipe with the same reward functions and GAE parameters to a different base model such as Llama-3.1-70B and observing either flat or declining benchmark curves accompanied by unstable advantage estimates or excessive repetition.
read the original abstract
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Open-Reasoner-Zero, an open-source RL pipeline for training reasoning on base LLMs. It claims that a minimalist vanilla PPO with GAE (λ=1, γ=1), rule-based rewards, and no KL regularization suffices to replicate scaling in benchmark performance and response length, achieving superior results on AIME2024, MATH500, and GPQA Diamond versus DeepSeek-R1-Zero-Qwen-32B on the identical Qwen2.5-32B base model while using only 1/10 the training steps. The work further analyzes training dynamics and shows that the learned critic devalues repetitive patterns for more stable advantage estimates.
Significance. If the empirical claims hold under scrutiny, the result would be significant for showing that complex reasoning scaling can emerge from a deliberately simple RL configuration without KL penalties or other regularizers, thereby lowering barriers to open research. The explicit release of source code, training data, and model weights is a clear strength that directly supports reproducibility and community follow-up.
major comments (3)
- [Experiments] Experiments section: the performance comparisons (superiority on AIME2024, MATH500, GPQA Diamond) are reported as single-point summaries without error bars, multiple random seeds, or statistical significance tests. This directly affects the load-bearing claim that the minimalist setup reliably outperforms the DeepSeek baseline.
- [Method] Method / PPO implementation: the description states 'vanilla PPO ... without any KL regularization,' yet the manuscript does not provide the exact hyper-parameter list, clipping thresholds, or code-level confirmation that no undisclosed filtering, advantage normalization, or auxiliary losses are present. This verification is required to substantiate the central minimalist claim.
- [Ablations] Ablation studies: while the abstract states that ablations for critical design choices are covered, the provided results lack complete quantitative tables (e.g., performance deltas when λ, γ, or reward components are altered). Full tables are needed to isolate which elements drive the reported scaling.
minor comments (2)
- [Figures] Figure captions for training curves should explicitly label axes, include legend entries for all compared runs, and state the number of seeds if applicable.
- [Method] The rule-based reward functions are described as 'straightforward' but would benefit from an explicit formula or pseudocode in the main text or appendix to allow exact replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, with revisions made where the manuscript required strengthening to better support our claims.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the performance comparisons (superiority on AIME2024, MATH500, GPQA Diamond) are reported as single-point summaries without error bars, multiple random seeds, or statistical significance tests. This directly affects the load-bearing claim that the minimalist setup reliably outperforms the DeepSeek baseline.
Authors: We agree that error bars, multiple seeds, and statistical tests would strengthen the empirical claims. The reported gains are large in magnitude and consistent across benchmarks, but single-run results due to high compute costs for 32B-scale RL do limit statistical robustness. We have revised the Experiments section to explicitly note this limitation, emphasize the open-source code release for independent multi-seed verification, and add any available training variance from logs. No new multi-seed runs were added in this revision. revision: partial
-
Referee: [Method] Method / PPO implementation: the description states 'vanilla PPO ... without any KL regularization,' yet the manuscript does not provide the exact hyper-parameter list, clipping thresholds, or code-level confirmation that no undisclosed filtering, advantage normalization, or auxiliary losses are present. This verification is required to substantiate the central minimalist claim.
Authors: We thank the referee for this clarification request. To fully substantiate the minimalist claim, we have added a complete hyperparameter table to the Method section listing all PPO settings (including clipping threshold ε=0.2), GAE parameters, and explicit confirmation of no KL penalty, undisclosed filtering, non-standard advantage normalization, or auxiliary losses. The released source code provides line-by-line verification of the implementation. revision: yes
-
Referee: [Ablations] Ablation studies: while the abstract states that ablations for critical design choices are covered, the provided results lack complete quantitative tables (e.g., performance deltas when λ, γ, or reward components are altered). Full tables are needed to isolate which elements drive the reported scaling.
Authors: We agree that more detailed quantitative ablation tables are needed. We have expanded the Ablations section with full tables reporting performance metrics and deltas for changes in λ, γ, and reward components. These tables directly isolate the contributions of each element and support the analysis of the critic's devaluation of repetitive patterns. revision: yes
Circularity Check
Empirical RL implementation study with no circular derivation chain
full rationale
The paper reports experimental results from training Qwen2.5-32B with vanilla PPO (GAE λ=1, γ=1, no KL term, rule-based rewards) and measures outcomes on external benchmarks (AIME2024, MATH500, GPQA Diamond). Performance gains and scaling behavior are shown via direct comparison to DeepSeek-R1-Zero and ablations; no equations or predictions reduce by construction to fitted parameters inside the paper, and no load-bearing step relies on self-citation or ansatz smuggling. The derivation chain is the training run itself, which is externally verifiable via released code and data.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard PPO and GAE assumptions hold for policy optimization on language model outputs
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel uncleara minimalist approach, vanilla PPO with GAE (λ=1, γ=1) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclearUsing the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond
-
Foundation.LawOfExistencedefect_zero_iff_one unclearthe learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns
Forward citations
Cited by 31 Pith papers
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
-
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
H\"older Policy Optimisation
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
-
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
-
Internalizing Safety Understanding in Large Reasoning Models via Verification
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment tha...
-
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
-
Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
DRRO for RLHF replaces worst-case value with worst-case regret in Wasserstein DRO, producing an exact water-filling solution under l1 ambiguity and a practical sampled-bonus algorithm that reduces proxy over-optimization.
-
From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space
GloRank reformulates list-wise reranking as token generation over a global item identifier space, using supervised pre-training followed by reinforcement learning to maximize list-wise utility and outperforming baseli...
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...
-
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.
-
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Reference graph
Works this paper leans on
-
[1]
OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-r eason-with-llms/, 2025
work page 2025
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[4]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Dapo: An open-source llm reinforcement learning system at scale, 2025
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page 2025
-
[6]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[7]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog
work page 2025
-
[9]
Exploring the limit of outcome reward for learning mathematical reasoning, 2025
Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, and Kai Chen. Exploring the limit of outcome reward for learning mathematical reasoning, 2025
work page 2025
-
[10]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Proceedings of the Neural Information Processing Systems T rack on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021 , 2021
work page 2021
-
[11]
Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath, 2024
work page 2024
-
[12]
Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris 14 Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajish...
work page 2025
-
[13]
Open r1: Evaluating llms on uncontaminated math competitions, February 2025
Loubna Ben Allal, Lewis Tunstall, Anton Lozhkov, Elie Bakouch, Guilherme Penedo, and Gabriel Martín Blázquez Hynek Kydlicek. Open r1: Evaluating llms on uncontaminated math competitions, February 2025
work page 2025
-
[14]
Demystifying long chain-of-thought reasoning in llms
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025
-
[15]
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, 2025
Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning, 2025
work page 2025
-
[16]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like Training: A Critical Perspective. arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild, 2025
work page 2025
-
[18]
Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks, 2025
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, Xiangyu Yu, Gaohong Liu, Juncai Liu, Lingjun Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Ru Zhang, Xin Liu, Mingxuan Wang, Yonghui Wu, and Lin Yan. Vapo: Efficient and reliable rei...
work page 2025
-
[19]
Deepcoder: A fully open-source 14b coder at o3-mini level, 2025
Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025. Notion Blog
work page 2025
-
[20]
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. Skywork open reasoner series. https://capricious -hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a8 0459b46c149e4f51680, 2025. Notion Blog
work page 2025
-
[21]
OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025
work page 2025
-
[22]
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of- the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025
-
[23]
Llama-nemotron: Efficient reasoning models, 2025
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Ziji...
work page 2025
-
[24]
Opencodereasoning: Advancing data distillation for competitive coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Sid- dhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025
-
[25]
Guilherme Penedo, Anton Lozhkov, Hynek Kydlíˇ cek, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025
work page 2025
-
[26]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 16 In this Appendix, we provide more elaboration on the implementation details, experiment results, and qualitative results. Specifical...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
with no bias term. The policy and critic do not share weights during training. For both policy and critic networks, we employ AdamW optimizer with 𝛽 = [0.9, 0.95] without weight decay. The learning rates are set to 1 × 10−6 and 5 × 10−6 for the policy and critic networks, respectively. The learning rate schedulers are both constant learning rate with line...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.