arxiv: 2504.05118 · v3 · submitted 2025-04-07 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue , Yufeng Yuan , Qiying Yu , Xiaochen Zuo , Ruofei Zhu , Wenyuan Xu , Jiaze Chen , Chengyi Wang

show 19 more authors

Tiantian Fan Zhengyin Du Xiangpeng Wei Xiangyu Yu Gaohong Liu Juncai Liu Lingjun Liu Haibin Lin Zhiqi Lin Bole Ma Chi Zhang Mofan Zhang Wang Zhang Hang Zhu Ru Zhang Xin Liu Mingxuan Wang Yonghui Wu Lin Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 09:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords VAPOvalue-based reinforcement learninglong chain-of-thoughtreasoning modelsAIME 2024proximal policy optimizationtraining stabilitysparse rewards

0 comments

The pith

VAPO reaches 60.4 on AIME 2024 by fixing value bias, variable lengths, and sparse rewards in RL for reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VAPO, a value-based augmented proximal policy optimization framework for training large reasoning models on long chain-of-thought tasks. It targets three persistent issues in value-based reinforcement learning: bias in the value model, sequences that vary in length, and rewards that appear only rarely. By integrating targeted fixes for these issues, VAPO delivers higher accuracy on hard math reasoning benchmarks while training faster and without the crashes common in prior runs. A reader would care because reliable training of advanced reasoning could reduce the compute and instability barriers that currently limit such models.

Core claim

VAPO provides an integrated solution to value model bias, heterogeneous sequence lengths, and reward sparsity in long-CoT reasoning. Built on the Qwen 32B model, the framework attains a score of 60.4 on the AIME 2024 dataset, outperforming prior reported results for DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points under identical settings. It reaches this performance in only 5000 training steps and maintains stability with no crashes across multiple independent runs.

What carries the argument

The VAPO framework, which augments proximal policy optimization with value-based components to mitigate bias, length heterogeneity, and reward sparsity during reasoning model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixes for bias and sparsity could be tested on non-math reasoning domains such as code generation or scientific question answering.
If stability scales with model size, value-based methods might become the default for long-horizon language-model training where crashes currently waste compute.
The 5000-step convergence suggests future experiments could measure wall-clock time or total tokens processed to quantify efficiency gains beyond step count.

Load-bearing premise

The performance and stability gains come from the specific VAPO design choices rather than unreported differences in data, hyperparameters, model initialization, or evaluation protocols.

What would settle it

Reproduce the AIME 2024 experiments using identical training data, hyperparameters, model initialization, and evaluation code to verify whether the 10-point margin and zero-crash stability still appear.

read the original abstract

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAPO claims a 60.4 AIME score and big gains over DeepSeek-R1-Zero and DAPO, but the identical-settings comparison is the part that needs checking.

read the letter

VAPO adds targeted fixes to a value-based version of PPO for long-CoT reasoning. It targets value-model bias, uneven sequence lengths, and sparse rewards with a set of design choices, then reports 60.4 on AIME 2024 using the Qwen 32B base. That beats the prior numbers for DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points, all in 5,000 steps with no crashes across runs. The stability and speed are the parts that stand out if they hold up under scrutiny. The paper is straightforward about the three problems it tries to solve and gives an integrated recipe rather than a single trick. That approach is practical for people already running value-based RL on reasoning models. The main soft spot is the baseline comparison. The abstract states identical experimental settings, but without side-by-side hyperparameter tables, data sources, or initialization details that match the cited papers exactly, small unreported differences could explain the gap. No ablations or variance numbers appear in the summary, so it is hard to isolate how much each VAPO component moves the needle. The work is aimed at groups training reasoning models with RL who want a value-based option that trains reliably. It is worth sending to referees because the claims are concrete and the fixes are testable; a full methods section and controls would let readers judge whether the gains are real or setup-dependent.

Referee Report

2 major / 1 minor

Summary. The paper introduces VAPO, a value-based augmented proximal policy optimization framework for long chain-of-thought reasoning in large language models. Built on Qwen-32B, it reports a state-of-the-art score of 60.4 on AIME 2024, outperforming DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points under identical experimental settings. The work highlights three challenges in value-based RL (value model bias, heterogeneous sequence lengths, reward sparsity) and claims an integrated design that yields stable, efficient training reaching SOTA performance in 5,000 steps with no crashes across multiple runs.

Significance. If the reported gains and stability are reproducible under truly matched conditions, the result would be significant for reliable RL-based reasoning, as it directly targets load-bearing issues in value-based methods for long-CoT tasks and demonstrates practical efficiency on a 32B model.

major comments (2)

[Abstract] Abstract: The central claim that VAPO outperforms DeepSeek-R1-Zero-Qwen-32B and DAPO by >10 points 'under identical experimental settings' is load-bearing for the performance contribution, yet the manuscript provides no side-by-side hyperparameter table, data-mixture citation, or reproduction protocol that verifies exact matching of initialization, optimizer schedule, reward shaping, length filtering, and evaluation protocol with the cited baselines.
[Methods] Methods (implied by abstract description): The integrated solutions for value-model bias, heterogeneous lengths, and reward sparsity are presented at a high level without quantitative ablations or controlled experiments isolating each component's contribution to the 60.4 score and crash-free training; this weakens attribution of the stability and efficiency gains specifically to VAPO design choices.

minor comments (1)

[Abstract] Abstract: The phrasing 'Benchmarked the AIME 2024 dataset' is grammatically incomplete and should be revised for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review of our manuscript. We appreciate the feedback on clarifying the experimental comparisons and providing more detailed ablations. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that VAPO outperforms DeepSeek-R1-Zero-Qwen-32B and DAPO by >10 points 'under identical experimental settings' is load-bearing for the performance contribution, yet the manuscript provides no side-by-side hyperparameter table, data-mixture citation, or reproduction protocol that verifies exact matching of initialization, optimizer schedule, reward shaping, length filtering, and evaluation protocol with the cited baselines.

Authors: We acknowledge the importance of verifiable identical settings. The comparisons were conducted by strictly following the hyperparameter settings, data mixtures, and protocols as described in the original DeepSeek-R1-Zero and DAPO papers. To address this, we will include a detailed side-by-side hyperparameter table in the revised manuscript, citing specific sections from the baseline papers for initialization, optimizer schedule, reward shaping, length filtering, and evaluation. Additionally, we plan to release our full training code and scripts to facilitate exact reproduction. revision: yes
Referee: [Methods] Methods (implied by abstract description): The integrated solutions for value-model bias, heterogeneous lengths, and reward sparsity are presented at a high level without quantitative ablations or controlled experiments isolating each component's contribution to the 60.4 score and crash-free training; this weakens attribution of the stability and efficiency gains specifically to VAPO design choices.

Authors: We agree that quantitative ablations would better isolate the contributions of each component. In the revised manuscript, we will add a new section with controlled ablation experiments. These will include variants where we disable the value bias correction, the heterogeneous length handling, and the reward sparsity mitigation one at a time, reporting the resulting performance on AIME 2024 and training stability metrics (e.g., crash rates and convergence speed). This will provide direct evidence for the impact of each design choice. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results with independent design claims

full rationale

The paper presents VAPO as an integrated RL framework addressing value-model bias, heterogeneous lengths, and reward sparsity in long-CoT reasoning. Central claims consist of empirical AIME 2024 benchmarks (60.4 score, >10-point gains over DeepSeek-R1-Zero-Qwen-32B and DAPO under identical settings, stable 5k-step training with no crashes). No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted inputs, self-citations, or ansatzes. The work is self-contained as an empirical contribution; design choices are described as systematic without load-bearing self-referential loops or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work is presented as an empirical engineering contribution rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5605 in / 1074 out tokens · 43570 ms · 2026-05-13T09:30:15.431236+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of 60.4. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
AIS: Adaptive Importance Sampling for Quantized RL
stat.ML 2026-05 unverdicted novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 6.0

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
AIPO: : Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Gradient Extrapolation-Based Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
cs.CL 2026-04 unverdicted novelty 6.0

LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR
cs.LG 2026-05 unverdicted novelty 4.0

Adaptive scheduling of penalties over training time plus confidence-based weighting of mistakes improves LLM performance on math reasoning benchmarks compared to fixed-penalty negative reinforcement.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 26 Pith papers · 13 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024. URL https://arxiv.org/abs/2402.14740

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Claude 3.5 sonnet, 2024

Anthropic. Claude 3.5 sonnet, 2024. URLhttps://www.anthropic.com/news/claude-3-5-sonnet

work page 2024
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

work page 1901
[4]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023

work page 2023
[5]

Gemini 2.0 flash thinking, 2024

Google DeepMind. Gemini 2.0 flash thinking, 2024. URLhttps://deepmind.google/technologies/gemini/ flash-thinking/

work page 2024
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Fletcher

Ron Good and Harold J. Fletcher. Reporting explained variance.Journal of Research in Science Teaching, 18(1): 1–7, 1981. doi: https://doi.org/10.1002/tea.3660180102. URLhttps://onlinelibrary.wiley.com/doi/abs/10. 1002/tea.3660180102

work page doi:10.1002/tea.3660180102 1981
[8]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner- zero: An open source approach to scaling up reinforcement learning on the base model, 2025. URLhttps: //arxiv.org/abs/2503.24290

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Buy 4 REINFORCE samples, get a baseline for free! InDeep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free! InDeep Reinforcement Learning Meets Structured Prediction, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=r1lgTGL5DE

work page 2019
[11]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URLhttps://arxiv.org/abs/2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. InProceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. mlsys.org, 2025

work page 2025
[14]

Self-imitation learning

Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th InternationalConference on MachineLearning, volume 80 ofProceedings of MachineLearning Research, pages 3878–3887. PMLR, 10–15 Jul 2018. URLhttps://proceedings.mlr.press/ v80/oh18b.html

work page 2018
[15]

GPT-4 Technical Report

OpenAI. GPT4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

work page 2024
[17]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

work page 2022
[18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022. 12

work page 2022
[19]

Qwq-32b: Embracing the power of reinforcement learning, 2024

Qwen. Qwq-32b: Embracing the power of reinforcement learning, 2024. URLhttps://qwenlm.github.io/blog/ qwq-32b/

work page 2024
[20]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Exploring data scaling trends and effects in reinforcement learning from human feedback

Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback.arXiv preprint arXiv:2503.22230, 2025

work page arXiv 2025
[24]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[25]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural...

work page 2022
[28]

Grok 3 beta — the age of reasoning agents, 2024

XAI. Grok 3 beta — the age of reasoning agents, 2024. URLhttps://x.ai/news/grok-3

work page 2024
[29]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025

Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret, 2025. URLhttps://arxiv.org/abs/2503.01491. 13

work page arXiv 2025