arxiv: 2504.14945 · v5 · submitted 2025-04-21 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Learning to Reason under Off-Policy Guidance

Jianhao Yan , Yafu Li , Zican Hu , Zhi Wang , Ganqu Cui , Xiaoye Qu , Yu Cheng , Yue Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords off-policy RLVRreasoning modelsMixed-Policy GRPOimportance samplingmathematical benchmarksweak model trainingout-of-distribution generalization

0 comments

The pith

LUFFY mixes off-policy reasoning traces with on-policy rollouts to overcome the limits of standard RLVR in training reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LUFFY as a framework that extends reinforcement learning with verifiable rewards by incorporating off-policy reasoning traces instead of restricting learning to a model's own outputs. This addresses the core restriction in existing RLVR methods that prevents acquisition of reasoning abilities beyond initial model capabilities. LUFFY balances imitation and exploration through a Mixed-Policy GRPO setup augmented by regularized importance sampling to avoid rigid copying. Results show average gains exceeding 6.4 points on six math benchmarks and 6.2 points on out-of-distribution tasks, with the method succeeding on weak models where pure on-policy RLVR fails entirely.

Core claim

LUFFY augments RLVR by dynamically combining off-policy reasoning traces with on-policy rollouts using the Mixed-Policy GRPO framework and policy shaping via regularized importance sampling. This yields average performance gains exceeding 6.4 points across six math benchmarks and over 6.2 points on out-of-distribution tasks. Most notably, it succeeds in training weak models in cases where on-policy RLVR fails entirely.

What carries the argument

Mixed-Policy GRPO combined with regularized importance sampling, which balances off-policy imitation and on-policy exploration while preventing superficial imitation.

If this is right

Models can acquire reasoning skills beyond their initial capabilities through off-policy guidance.
Training succeeds for weak models where pure on-policy methods fail.
Performance improves by over 6.4 points on average for math reasoning benchmarks.
Out-of-distribution generalization gains exceed 6.2 points.
The framework provides a theoretically guaranteed convergence rate via Mixed-Policy GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Off-policy data from stronger models could be used to bootstrap capabilities in smaller or weaker models more broadly.
This approach might extend to other verifiable reward domains such as code generation or scientific problem solving.
Regularized importance sampling could be adapted to other RL settings involving mixed policies to avoid distribution shift.
The proportion of off-policy traces could be scaled dynamically based on model strength in future training pipelines.

Load-bearing premise

Off-policy reasoning traces can be mixed with on-policy rollouts via regularized importance sampling without introducing harmful distribution shift or causing superficial imitation.

What would settle it

An experiment showing that removing the off-policy component or the regularization leads to performance no better than standard on-policy RLVR on the same math benchmarks, or failure to train weak models.

read the original abstract

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LUFFY adds off-policy traces to RLVR via mixed GRPO and regularized importance sampling, with reported benchmark gains, but the abstract leaves the mechanism and controls underspecified.

read the letter

The core contribution is a framework that mixes off-policy reasoning traces into RLVR training. It uses Mixed-Policy GRPO plus regularized importance sampling to let weaker models learn from stronger demonstrations without collapsing into pure imitation. That addresses a real limit of standard on-policy RLVR, which the abstract correctly notes cannot exceed the model's initial capability ceiling. The reported numbers—an average +6.4 gain across six math benchmarks and +6.2 on out-of-distribution tasks, plus successful training of weak models where on-policy RLVR fails—are the clearest empirical hook. If those hold after controls, the idea is worth attention for anyone scaling reasoning models. The theoretical claim of guaranteed convergence for the mixed policy is also stated cleanly. The main weaknesses are the missing details. The abstract does not describe how the off-policy traces were generated, what the regularization term actually does in the loss, or any ablation that isolates its effect. There are no reported KL or divergence diagnostics between the learned policy and the behavior policy, so the stress-test concern about superficial imitation cannot be ruled out from the given text. Statistical significance and experimental controls are also absent. This leaves the central claim—that the mixing mechanism, not just better data, drives the gains—hard to evaluate. The paper is aimed at researchers working on RLVR and reasoning model training. Anyone following that line would find the direction useful to discuss, even if the current write-up needs more evidence. It should go to peer review so the methods and results sections can be checked directly.

Referee Report

3 major / 2 minor

Summary. The paper introduces LUFFY, a framework that augments on-policy RLVR with off-policy reasoning traces generated by stronger models. It combines Mixed-Policy GRPO (with a claimed theoretical convergence guarantee) and regularized importance sampling to balance imitation and exploration, reporting an average gain of over +6.4 points across six math benchmarks, +6.2 points on out-of-distribution tasks, and successful training of weak models in regimes where pure on-policy RLVR fails.

Significance. If the empirical results are robust, the work would be significant for RLVR research by demonstrating a practical way to escape the capability ceiling of on-policy methods through controlled off-policy guidance. The explicit convergence claim for the mixed-policy objective is a positive feature that distinguishes it from purely heuristic mixing approaches.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the headline +6.4 average gain and the claim that LUFFY succeeds where on-policy RLVR “completely fails” are presented without reported statistical significance, number of seeds, or details on how the off-policy traces were generated and filtered; these omissions make it impossible to determine whether the gains are attributable to the proposed mixing mechanism rather than simply higher-quality demonstration data.
[§3.2] §3.2 (Mixed-Policy GRPO with Regularized Importance Sampling): the paper states that regularized importance sampling prevents superficial imitation, yet no policy-divergence diagnostics (KL divergence, total variation, or effective sample size) are reported between the learned policy and the off-policy behavior policy; without these, the central claim that the regularization successfully balances imitation and exploration remains unverified.
[§4] §4 (Ablations): no ablation isolates the contribution of the regularization term in the importance-sampling objective; the reported gains could be explained by the off-policy data alone, undermining the load-bearing assertion that the specific mixed-policy formulation is what enables training of weak models.

minor comments (2)

[Abstract] The acronym expansion in the title and abstract (“oFF-policY”) contains inconsistent capitalization that should be standardized.
[§3] Notation for the importance-sampling weights and regularization coefficient is introduced without an explicit table of symbols, making the equations in §3 harder to follow on first reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript accordingly to improve reproducibility and strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline +6.4 average gain and the claim that LUFFY succeeds where on-policy RLVR “completely fails” are presented without reported statistical significance, number of seeds, or details on how the off-policy traces were generated and filtered; these omissions make it impossible to determine whether the gains are attributable to the proposed mixing mechanism rather than simply higher-quality demonstration data.

Authors: We agree that statistical significance, seed counts, and generation/filtering details are necessary for reproducibility and to attribute gains to the mixing mechanism. In the revision we will report results over at least three random seeds with standard deviations, include p-values for key comparisons, and add explicit details on off-policy trace generation (stronger models with fixed prompts) and filtering (reward-threshold selection). These additions will clarify that performance improvements stem from the mixed-policy formulation rather than data quality alone. revision: yes
Referee: [§3.2] §3.2 (Mixed-Policy GRPO with Regularized Importance Sampling): the paper states that regularized importance sampling prevents superficial imitation, yet no policy-divergence diagnostics (KL divergence, total variation, or effective sample size) are reported between the learned policy and the off-policy behavior policy; without these, the central claim that the regularization successfully balances imitation and exploration remains unverified.

Authors: We acknowledge the value of direct diagnostics. The revised manuscript will include KL divergence, total variation distance, and effective sample size measurements between the learned policy and the off-policy behavior policy, presented in §3.2. These metrics will empirically verify that regularization prevents collapse to superficial imitation while preserving exploration. revision: yes
Referee: [§4] §4 (Ablations): no ablation isolates the contribution of the regularization term in the importance-sampling objective; the reported gains could be explained by the off-policy data alone, undermining the load-bearing assertion that the specific mixed-policy formulation is what enables training of weak models.

Authors: We agree that an ablation isolating the regularization term is required. We will add to §4 a direct comparison of Mixed-Policy GRPO with and without the regularization term in the importance-sampling objective. The new results will show degraded performance and increased imitation without regularization, confirming its role in enabling training of weak models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents LUFFY as a practical augmentation of RLVR that mixes off-policy traces with on-policy rollouts via Mixed-Policy GRPO and regularized importance sampling. Reported gains (+6.4 average, +6.2 OOD) are framed as experimental outcomes on math benchmarks rather than predictions derived from fitted parameters or self-referential definitions. The stated theoretical convergence rate is attributed to the Mixed-Policy GRPO component without any equation in the provided text reducing the core claims to tautology or input renaming. No load-bearing step collapses by construction to its own inputs, self-citation chain, or ansatz smuggling. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework name LUFFY and the two algorithmic pieces (Mixed-Policy GRPO, regularized importance sampling) are introduced as the contribution.

pith-pipeline@v0.9.0 · 5578 in / 1165 out tokens · 16593 ms · 2026-05-15T23:13:37.696664+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
Near-Future Policy Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
cs.LG 2026-04 unverdicted novelty 7.0

NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
cs.LG 2026-04 unverdicted novelty 6.0

PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
cs.DC 2026-04 unverdicted novelty 6.0

TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
Can LLMs Learn to Reason Robustly under Noisy Supervision?
cs.LG 2026-04 conditional novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
cs.LG 2026-03 unverdicted novelty 6.0

HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
A Survey of On-Policy Distillation for Large Language Models
cs.LG 2026-04 unverdicted novelty 2.0

On-policy distillation reframes LLM knowledge transfer as iterative correction on student trajectories rather than single-pass imitation, with the survey organizing the field along divergence design, feedback sources,...

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 18 Pith papers · 17 internal anchors

[1]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

work page 2022
[5]

7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient

Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/simplerl-reason, 2025. Notion Blog

work page 2025
[6]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

work page 2025
[8]

Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025

Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025

work page 2025
[9]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

work page 2025
[10]

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars, 2025

work page 2025
[11]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, September 2024

Meta AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, September 2024. 15 minute read

work page 2024
[12]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[13]

Jiang, Ziju Shen, et al

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https: //huggingface.co/datasets/Numinamath, 2024. Hugging Face repository, 13:9

work page 2024
[14]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page 2024
[15]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022

work page 2022
[16]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Defining and characterizing reward gaming

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

work page 2022
[19]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

work page 2023
[20]

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025

work page 2025
[21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[24]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page 2025
[26]

Stochastic variance reduction for nonconvex optimization

Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In ICML, pages 314–323, 2016

work page 2016
[27]

Off-policy proximal policy optimization

Wenjia Meng, Qian Zheng, Gang Pan, and Yilong Yin. Off-policy proximal policy optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9162–9170, 2023

work page 2023
[28]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025

work page 2025
[29]

Numinamath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/nu...

work page 2024
[30]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

work page 2024
[31]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

The llama 3 herd of models, 2024

Meta Team. The llama 3 herd of models, 2024

work page 2024
[33]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018. 12

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

work page 2024
[35]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Limo: Less is more for reasoning

Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025

work page arXiv 2025
[38]

Why distillation can outperform zero-rl: The role of flexible reasoning, 2025

Xiao Hu, Xingyu Lu, Liyuan Mao, YiFan Zhang, Tianke Zhang, Bin Wen, Fan Yang, Tingting Gao, and Guorui Zhou. Why distillation can outperform zero-rl: The role of flexible reasoning, 2025

work page 2025
[39]

Hewett, Mojan Javaheripi, Piero Kauffmann, James R

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

work page 2024
[40]

A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond, 2025

Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian- Sheng Hua, Bowen Zhou, and Yu Cheng. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond, 2025

work page 2025
[41]

On-policy rl with optimal reward baseline, 2025

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline, 2025

work page 2025
[42]

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Bolt: Bootstrap long chain-of-thought in language models without distillation

Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, and Caiming Xiong. Bolt: Bootstrap long chain-of-thought in language models without distillation. arXiv preprint arXiv:2502.03860, 2025

work page arXiv 2025
[44]

Concise reasoning via reinforcement learning

Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, and Kartik Talamadupula. Concise reasoning via reinforcement learning. arXiv preprint arXiv:2504.05185, 2025

work page arXiv 2025
[45]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning

Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning. arXiv preprint arXiv:2504.04524, 2025

work page arXiv 2025
[47]

Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond

Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. arXiv preprint arXiv:2503.10460, 2025

work page arXiv 2025
[48]

Thinking preference optimization

Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, and Xiaotian Han. Thinking preference optimization. arXiv preprint arXiv:2502.13173, 2025

work page arXiv 2025
[49]

Asynchronous methods for deep reinforce- ment learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. In International conference on machine learning, pages 1928–1937. PmLR, 2016

work page 1928
[50]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 13

work page internal anchor Pith review Pith/arXiv arXiv 2013
[51]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018

work page 2018
[52]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[53]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

work page 1999
[54]

Bridging supervised learning and reinforcement learning in math reasoning

Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning. arXiv preprint arXiv:2505.18116, 2025

work page arXiv 2025
[55]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[56]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. arXiv preprint arXiv:2501.05444, 2025

work page arXiv 2025
[57]

Scaling reasoning, losing control: Evaluating instruction following in large reasoning models

Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, and Yu Cheng. Scaling reasoning, losing control: Evaluating instruction following in large reasoning models. arXiv preprint arXiv:2505.14810, 2025

work page arXiv 2025
[58]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[60]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

work page 2018
[61]

Statistical significance tests for machine translation evaluation

Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 388–395, 2004

work page 2004
[62]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Compu...

work page 2002
[63]

Scaling test-time compute without verification or RL is suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or RL is suboptimal. In ICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025

work page 2025
[64]

Le, Sergey Levine, and Yi Ma

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

work page 2025
[65]

<think>\n thoughts </think>\n

Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. https://github.com/UCSC-VLAA/VLAA-Thinking, 2025. 14 Appendix A Limitations 15 B Theoretical Proof 15 B.1 Convergence Rate of the Importance-Weighted Policy Gradient...

work page 2025
[66]

</think> [Final Answer] Thus, the maximum possible number of isosceles triangles with two odd sides is 1003 . Tokens Length: 2623 Correctness: True Answer:

Dissection ..." </think> [Final Answer] Thus, the maximum possible number of isosceles triangles with two odd sides is 1003 . Tokens Length: 2623 Correctness: True Answer: "$1003$" Figure 7: Comparison of three approaches (SFT, On-Policy RL, and LUFFY) for a geometric problem. 23

work page