pith. machine review for the scientific record. sign in

arxiv: 2504.14945 · v5 · submitted 2025-04-21 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Learning to Reason under Off-Policy Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords off-policy RLVRreasoning modelsMixed-Policy GRPOimportance samplingmathematical benchmarksweak model trainingout-of-distribution generalization
0
0 comments X

The pith

LUFFY mixes off-policy reasoning traces with on-policy rollouts to overcome the limits of standard RLVR in training reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LUFFY as a framework that extends reinforcement learning with verifiable rewards by incorporating off-policy reasoning traces instead of restricting learning to a model's own outputs. This addresses the core restriction in existing RLVR methods that prevents acquisition of reasoning abilities beyond initial model capabilities. LUFFY balances imitation and exploration through a Mixed-Policy GRPO setup augmented by regularized importance sampling to avoid rigid copying. Results show average gains exceeding 6.4 points on six math benchmarks and 6.2 points on out-of-distribution tasks, with the method succeeding on weak models where pure on-policy RLVR fails entirely.

Core claim

LUFFY augments RLVR by dynamically combining off-policy reasoning traces with on-policy rollouts using the Mixed-Policy GRPO framework and policy shaping via regularized importance sampling. This yields average performance gains exceeding 6.4 points across six math benchmarks and over 6.2 points on out-of-distribution tasks. Most notably, it succeeds in training weak models in cases where on-policy RLVR fails entirely.

What carries the argument

Mixed-Policy GRPO combined with regularized importance sampling, which balances off-policy imitation and on-policy exploration while preventing superficial imitation.

If this is right

  • Models can acquire reasoning skills beyond their initial capabilities through off-policy guidance.
  • Training succeeds for weak models where pure on-policy methods fail.
  • Performance improves by over 6.4 points on average for math reasoning benchmarks.
  • Out-of-distribution generalization gains exceed 6.2 points.
  • The framework provides a theoretically guaranteed convergence rate via Mixed-Policy GRPO.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Off-policy data from stronger models could be used to bootstrap capabilities in smaller or weaker models more broadly.
  • This approach might extend to other verifiable reward domains such as code generation or scientific problem solving.
  • Regularized importance sampling could be adapted to other RL settings involving mixed policies to avoid distribution shift.
  • The proportion of off-policy traces could be scaled dynamically based on model strength in future training pipelines.

Load-bearing premise

Off-policy reasoning traces can be mixed with on-policy rollouts via regularized importance sampling without introducing harmful distribution shift or causing superficial imitation.

What would settle it

An experiment showing that removing the off-policy component or the regularization leads to performance no better than standard on-policy RLVR on the same math benchmarks, or failure to train weak models.

read the original abstract

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LUFFY, a framework that augments on-policy RLVR with off-policy reasoning traces generated by stronger models. It combines Mixed-Policy GRPO (with a claimed theoretical convergence guarantee) and regularized importance sampling to balance imitation and exploration, reporting an average gain of over +6.4 points across six math benchmarks, +6.2 points on out-of-distribution tasks, and successful training of weak models in regimes where pure on-policy RLVR fails.

Significance. If the empirical results are robust, the work would be significant for RLVR research by demonstrating a practical way to escape the capability ceiling of on-policy methods through controlled off-policy guidance. The explicit convergence claim for the mixed-policy objective is a positive feature that distinguishes it from purely heuristic mixing approaches.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline +6.4 average gain and the claim that LUFFY succeeds where on-policy RLVR “completely fails” are presented without reported statistical significance, number of seeds, or details on how the off-policy traces were generated and filtered; these omissions make it impossible to determine whether the gains are attributable to the proposed mixing mechanism rather than simply higher-quality demonstration data.
  2. [§3.2] §3.2 (Mixed-Policy GRPO with Regularized Importance Sampling): the paper states that regularized importance sampling prevents superficial imitation, yet no policy-divergence diagnostics (KL divergence, total variation, or effective sample size) are reported between the learned policy and the off-policy behavior policy; without these, the central claim that the regularization successfully balances imitation and exploration remains unverified.
  3. [§4] §4 (Ablations): no ablation isolates the contribution of the regularization term in the importance-sampling objective; the reported gains could be explained by the off-policy data alone, undermining the load-bearing assertion that the specific mixed-policy formulation is what enables training of weak models.
minor comments (2)
  1. [Abstract] The acronym expansion in the title and abstract (“oFF-policY”) contains inconsistent capitalization that should be standardized.
  2. [§3] Notation for the importance-sampling weights and regularization coefficient is introduced without an explicit table of symbols, making the equations in §3 harder to follow on first reading.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each point below and will revise the manuscript accordingly to improve reproducibility and strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline +6.4 average gain and the claim that LUFFY succeeds where on-policy RLVR “completely fails” are presented without reported statistical significance, number of seeds, or details on how the off-policy traces were generated and filtered; these omissions make it impossible to determine whether the gains are attributable to the proposed mixing mechanism rather than simply higher-quality demonstration data.

    Authors: We agree that statistical significance, seed counts, and generation/filtering details are necessary for reproducibility and to attribute gains to the mixing mechanism. In the revision we will report results over at least three random seeds with standard deviations, include p-values for key comparisons, and add explicit details on off-policy trace generation (stronger models with fixed prompts) and filtering (reward-threshold selection). These additions will clarify that performance improvements stem from the mixed-policy formulation rather than data quality alone. revision: yes

  2. Referee: [§3.2] §3.2 (Mixed-Policy GRPO with Regularized Importance Sampling): the paper states that regularized importance sampling prevents superficial imitation, yet no policy-divergence diagnostics (KL divergence, total variation, or effective sample size) are reported between the learned policy and the off-policy behavior policy; without these, the central claim that the regularization successfully balances imitation and exploration remains unverified.

    Authors: We acknowledge the value of direct diagnostics. The revised manuscript will include KL divergence, total variation distance, and effective sample size measurements between the learned policy and the off-policy behavior policy, presented in §3.2. These metrics will empirically verify that regularization prevents collapse to superficial imitation while preserving exploration. revision: yes

  3. Referee: [§4] §4 (Ablations): no ablation isolates the contribution of the regularization term in the importance-sampling objective; the reported gains could be explained by the off-policy data alone, undermining the load-bearing assertion that the specific mixed-policy formulation is what enables training of weak models.

    Authors: We agree that an ablation isolating the regularization term is required. We will add to §4 a direct comparison of Mixed-Policy GRPO with and without the regularization term in the importance-sampling objective. The new results will show degraded performance and increased imitation without regularization, confirming its role in enabling training of weak models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents LUFFY as a practical augmentation of RLVR that mixes off-policy traces with on-policy rollouts via Mixed-Policy GRPO and regularized importance sampling. Reported gains (+6.4 average, +6.2 OOD) are framed as experimental outcomes on math benchmarks rather than predictions derived from fitted parameters or self-referential definitions. The stated theoretical convergence rate is attributed to the Mixed-Policy GRPO component without any equation in the provided text reducing the core claims to tautology or input renaming. No load-bearing step collapses by construction to its own inputs, self-citation chain, or ansatz smuggling. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework name LUFFY and the two algorithmic pieces (Mixed-Policy GRPO, regularized importance sampling) are introduced as the contribution.

pith-pipeline@v0.9.0 · 5578 in / 1165 out tokens · 16593 ms · 2026-05-15T23:13:37.696664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  2. Multi-Rollout On-Policy Distillation via Peer Successes and Failures

    cs.LG 2026-05 unverdicted novelty 7.0

    MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

  3. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  4. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  5. Near-Future Policy Optimization

    cs.LG 2026-04 unverdicted novelty 7.0

    NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating co...

  6. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

    cs.LG 2026-04 unverdicted novelty 7.0

    NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

  7. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  8. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  9. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  10. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  11. SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.

  12. From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

    cs.LG 2026-04 unverdicted novelty 6.0

    PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

  13. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.

  14. TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

    cs.DC 2026-04 unverdicted novelty 6.0

    TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.

  15. Can LLMs Learn to Reason Robustly under Noisy Supervision?

    cs.LG 2026-04 conditional novelty 6.0

    Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...

  16. Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

    cs.LG 2026-03 unverdicted novelty 6.0

    HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.

  17. OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

  18. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  19. A Survey of On-Policy Distillation for Large Language Models

    cs.LG 2026-04 unverdicted novelty 2.0

    On-policy distillation reframes LLM knowledge transfer as iterative correction on student trajectories rather than single-pass imitation, with the survey organizing the field along divergence design, feedback sources,...

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 18 Pith papers · 17 internal anchors

  1. [1]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

  4. [4]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

  5. [5]

    7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient

    Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. https://hkust-nlp.notion.site/simplerl-reason, 2025. Notion Blog

  6. [6]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025

  7. [7]

    Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025

  8. [8]

    Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025

    Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025

  9. [9]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025

  10. [10]

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D. Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars, 2025

  11. [11]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, September 2024

    Meta AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, September 2024. 15 minute read

  12. [12]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

  13. [13]

    Jiang, Ziju Shen, et al

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https: //huggingface.co/datasets/Numinamath, 2024. Hugging Face repository, 13:9

  14. [14]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  15. [15]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022

  16. [16]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 11

  17. [17]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016

  18. [18]

    Defining and characterizing reward gaming

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

  19. [19]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  20. [20]

    Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild, 2025

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  22. [23]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015

  23. [24]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025

  24. [25]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  25. [26]

    Stochastic variance reduction for nonconvex optimization

    Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In ICML, pages 314–323, 2016

  26. [27]

    Off-policy proximal policy optimization

    Wenjia Meng, Qian Zheng, Gang Pan, and Yilong Yin. Off-policy proximal policy optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9162–9170, 2023

  27. [28]

    Open r1: A fully open reproduction of deepseek-r1, January 2025

    Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025

  28. [29]

    Numinamath

    Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-1.5](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/nu...

  29. [30]

    Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

  30. [31]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  31. [32]

    The llama 3 herd of models, 2024

    Meta Team. The llama 3 herd of models, 2024

  32. [33]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018. 12

  33. [34]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024

  34. [35]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024

  35. [36]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  36. [37]

    Limo: Less is more for reasoning

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025

  37. [38]

    Why distillation can outperform zero-rl: The role of flexible reasoning, 2025

    Xiao Hu, Xingyu Lu, Liyuan Mao, YiFan Zhang, Tianke Zhang, Bin Wen, Fan Yang, Tingting Gao, and Guorui Zhou. Why distillation can outperform zero-rl: The role of flexible reasoning, 2025

  38. [39]

    Hewett, Mojan Javaheripi, Piero Kauffmann, James R

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu,...

  39. [40]

    A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond, 2025

    Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian- Sheng Hua, Bowen Zhou, and Yu Cheng. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond, 2025

  40. [41]

    On-policy rl with optimal reward baseline, 2025

    Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline, 2025

  41. [42]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084, 2025

  42. [43]

    Bolt: Bootstrap long chain-of-thought in language models without distillation

    Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, and Caiming Xiong. Bolt: Bootstrap long chain-of-thought in language models without distillation. arXiv preprint arXiv:2502.03860, 2025

  43. [44]

    Concise reasoning via reinforcement learning

    Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, and Kartik Talamadupula. Concise reasoning via reinforcement learning. arXiv preprint arXiv:2504.05185, 2025

  44. [45]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025

  45. [46]

    Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning

    Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning. arXiv preprint arXiv:2504.04524, 2025

  46. [47]

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. arXiv preprint arXiv:2503.10460, 2025

  47. [48]

    Thinking preference optimization

    Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, and Xiaotian Han. Thinking preference optimization. arXiv preprint arXiv:2502.13173, 2025

  48. [49]

    Asynchronous methods for deep reinforce- ment learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. In International conference on machine learning, pages 1928–1937. PmLR, 2016

  49. [50]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 13

  50. [51]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018

  51. [52]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr, 2018

  52. [53]

    Policy gradient meth- ods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

  53. [54]

    Bridging supervised learning and reinforcement learning in math reasoning

    Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning. arXiv preprint arXiv:2505.18116, 2025

  54. [55]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  55. [56]

    Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark

    Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark. arXiv preprint arXiv:2501.05444, 2025

  56. [57]

    Scaling reasoning, losing control: Evaluating instruction following in large reasoning models

    Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, and Yu Cheng. Scaling reasoning, losing control: Evaluating instruction following in large reasoning models. arXiv preprint arXiv:2505.14810, 2025

  57. [58]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017

  58. [59]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

  59. [60]

    Reinforcement Learning: An Introduction

    Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

  60. [61]

    Statistical significance tests for machine translation evaluation

    Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 388–395, 2004

  61. [62]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Compu...

  62. [63]

    Scaling test-time compute without verification or RL is suboptimal

    Amrith Setlur, Nived Rajaraman, Sergey Levine, and Aviral Kumar. Scaling test-time compute without verification or RL is suboptimal. In ICLR 2025 Workshop: VerifAI: AI Verification in the Wild, 2025

  63. [64]

    Le, Sergey Levine, and Yi Ma

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

  64. [65]

    <think>\n thoughts </think>\n

    Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models. https://github.com/UCSC-VLAA/VLAA-Thinking, 2025. 14 Appendix A Limitations 15 B Theoretical Proof 15 B.1 Convergence Rate of the Importance-Weighted Policy Gradient...

  65. [66]

    </think> [Final Answer] Thus, the maximum possible number of isosceles triangles with two odd sides is 1003 . Tokens Length: 2623 Correctness: True Answer:

    Dissection ..." </think> [Final Answer] Thus, the maximum possible number of isosceles triangles with two odd sides is 1003 . Tokens Length: 2623 Correctness: True Answer: "$1003$" Figure 7: Comparison of three approaches (SFT, On-Policy RL, and LUFFY) for a geometric problem. 23