Recognition: unknown
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
Pith reviewed 2026-05-08 09:49 UTC · model grok-4.3
The pith
A new RL method lets LLMs learn reasoning solely from successful examples and still match or beat GRPO performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
POPO enables reinforcement learning with verifiable rewards to proceed using exclusively online positive rollouts by applying bounded importance sampling, from which implicit negative gradients arise naturally via rollouts redistribution. Stability is maintained through a siamese policy network employing momentum-based adaptation and a bounded similarity penalty in representation space replacing KL-divergence. On mathematical benchmarks with Qwen models, this yields performance comparable or superior to GRPO, including 36.67% accuracy on AIME 2025 versus 30%.
What carries the argument
The Positive-Only Policy Optimization framework that uses bounded importance sampling over positive rollouts to produce implicit negative gradients, stabilized by a siamese policy network with momentum adaptation and a bounded similarity penalty.
If this is right
- Policy updates can be computed without generating or scoring disjoint negative rollouts, reducing the sampling burden under sparse binary rewards.
- The redistribution of positive rollouts can supply sufficient gradient signal to guide learning away from poor reasoning paths.
- Replacing KL divergence with a bounded similarity penalty allows more flexible policy evolution while preserving stability.
- The same components can be applied across different sizes in the Qwen family and multiple levels of math benchmarks with consistent results.
Where Pith is reading between the lines
- The approach could lower the cost of online RL by focusing data collection on high-reward traces only.
- It may extend to other sparse-reward domains where defining clear negative examples is difficult.
- Future methods could prioritize curating diverse positive demonstrations rather than balancing positive and negative samples.
- If the stabilization holds, similar siamese-momentum designs might appear in other online policy optimization settings.
Load-bearing premise
That implicit negative gradients will emerge naturally from redistributing rollouts to reinforce positive probabilities and that the siamese network with momentum adaptation and bounded similarity penalty will keep training stable without any negative samples.
What would settle it
An experiment on the same Qwen-Math-7B model and AIME 2025 setup where disabling the rollout redistribution step causes POPO accuracy to fall below GRPO's 30 percent.
Figures
read the original abstract
Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Positive-Only Policy Optimization (POPO), a novel RLVR framework for LLMs that performs policy optimization exclusively from online positive rollouts via bounded importance sampling, without any disjoint negative rollouts. It claims that implicit negative gradients emerge naturally from redistributing and reinforcing positive probabilities. Stability is achieved through a siamese policy network with momentum-based adaptation and a bounded similarity penalty replacing KL divergence. Experiments on mathematical reasoning benchmarks with Qwen-family models show POPO achieving performance comparable or superior to GRPO, including 36.67% on AIME 2025 versus GRPO's 30%.
Significance. If the implicit-negative-gradient mechanism is rigorously shown to penalize unsampled failures, POPO could meaningfully simplify RLVR pipelines by removing reliance on negative samples whose combinatorial scale and binary sparsity limit their utility. The use of publicly available models and mention of ablation/sweep studies are positive for reproducibility. The reported AIME gains, if statistically robust, would indicate practical value for LLM reasoning enhancement.
major comments (2)
- [Abstract / Method] Abstract / Method: The assertion that 'implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution' is presented without any equation for the resulting policy gradient after bounded importance sampling and redistribution. It is therefore impossible to confirm whether the update contains a term that systematically lowers probability on low-reward trajectories outside the sampled positive set, or whether the method reduces to positive-only supervised fine-tuning.
- [Experiments] Experiments: The 6.67-point AIME 2025 gain (36.67% vs. GRPO 30.00% with Qwen-Math-7B) is reported without any mention of the number of runs, standard deviation, or statistical significance testing. This detail is load-bearing for the claim of outperformance and must be supplied with the full results.
minor comments (2)
- The abstract introduces the 'siamese policy network' and 'bounded similarity penalty' without providing the precise formulation or how the penalty is bounded relative to the original KL term; these should appear as explicit equations in §3 or §4.
- While the abstract states that ablation and sweep studies 'illustrate the necessity and robustness of POPO components,' no specific ablation results, tables, or figures are referenced, which should be added for completeness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving mathematical rigor and experimental reporting. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract / Method] The assertion that 'implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution' is presented without any equation for the resulting policy gradient after bounded importance sampling and redistribution. It is therefore impossible to confirm whether the update contains a term that systematically lowers probability on low-reward trajectories outside the sampled positive set, or whether the method reduces to positive-only supervised fine-tuning.
Authors: We agree that an explicit derivation is necessary to substantiate the claim of implicit negative gradients. While the manuscript describes the mechanism conceptually, it does not include the full policy gradient expression post-bounded importance sampling and redistribution. In the revised version, we will add a dedicated subsection deriving the gradient, showing how the normalization over the positive rollout set produces a negative contribution to the probabilities of unsampled low-reward trajectories. This will include the mathematical steps demonstrating that the update differs from positive-only supervised fine-tuning by incorporating an implicit penalization term. revision: yes
-
Referee: [Experiments] The 6.67-point AIME 2025 gain (36.67% vs. GRPO 30.00% with Qwen-Math-7B) is reported without any mention of the number of runs, standard deviation, or statistical significance testing. This detail is load-bearing for the claim of outperformance and must be supplied with the full results.
Authors: We concur that statistical details are essential to support the performance claims. The AIME 2025 results were obtained from 3 independent runs with different random seeds. In the revised manuscript, we will expand the Experiments section to report the mean accuracies, standard deviations (e.g., 36.67% ± 1.15% for POPO), and statistical significance tests (paired t-test p-values) for AIME 2025 and all other benchmarks. A new table will present these full results for transparency and reproducibility. revision: yes
Circularity Check
No circularity: derivation is self-contained
full rationale
The paper proposes POPO as a novel RLVR framework using only positive rollouts with bounded importance sampling, siamese networks, and a bounded similarity penalty. The claim that implicit negative gradients emerge via positive probability redistribution is presented as a direct consequence of the described mechanism rather than a fitted parameter renamed as prediction or a result imported via self-citation. No equations or steps in the abstract or described components reduce by construction to the inputs; performance claims rest on empirical benchmarks (e.g., AIME 2025) rather than tautological derivations. The framework is independently motivated and tested without load-bearing self-referential loops.
Axiom & Free-Parameter Ledger
free parameters (2)
- momentum adaptation rate
- similarity penalty bound
axioms (2)
- domain assumption Bounded importance sampling over positive rollouts provides valid gradient estimates for policy optimization.
- ad hoc to paper Rollout redistribution naturally produces implicit negative gradients.
Reference graph
Works this paper leans on
-
[1]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review arXiv
-
[3]
Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[5]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[6]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review arXiv
-
[7]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=
work page internal anchor Pith review arXiv
-
[8]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review arXiv
-
[9]
Advances in neural information processing systems , volume=
Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=
-
[10]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=
work page internal anchor Pith review arXiv
-
[11]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Exploring simple siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[12]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review arXiv
-
[13]
Advances in neural information processing systems , volume=
Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
-
[14]
Advances in neural information processing systems , volume=
Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
-
[15]
International Conference on Machine Learning , pages=
Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[16]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[17]
Understanding R1-Zero-Like Training: A Critical Perspective
Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=
work page internal anchor Pith review arXiv
-
[18]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=
work page internal anchor Pith review arXiv
-
[19]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review arXiv
-
[20]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[21]
International conference on machine learning , pages=
A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=
2020
-
[22]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[23]
Scaling relationship on learning mathematical reasoning with large language models , author=. arXiv preprint arXiv:2308.01825 , year=
-
[24]
Advances in Neural Information Processing Systems , volume=
Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[25]
arXiv preprint arXiv:2309.06657 , year=
Statistical rejection sampling improves preference optimization , author=. arXiv preprint arXiv:2309.06657 , year=
-
[26]
Notion Blog , volume=
Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl , author=. Notion Blog , volume=
-
[27]
Hugging Face repository , volume=
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
-
[28]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=
work page internal anchor Pith review arXiv
-
[29]
The twelfth international conference on learning representations , year=
Let's verify step by step , author=. The twelfth international conference on learning representations , year=
-
[30]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[31]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=
work page internal anchor Pith review arXiv
-
[32]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review arXiv
-
[33]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review arXiv
-
[34]
Advances in neural information processing systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
-
[35]
Proceedings of the 29th symposium on operating systems principles , pages=
Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
-
[36]
Biometrics bulletin , volume=
Individual comparisons by ranking methods , author=. Biometrics bulletin , volume=. 1945 , publisher=
1945
-
[37]
arXiv preprint arXiv:2410.08146 , year=
Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=
-
[38]
Process Reinforcement through Implicit Rewards
Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=
work page internal anchor Pith review arXiv
-
[39]
Stable and Efficient Single-Rollout RL for Multimodal Reasoning , author=. arXiv preprint arXiv:2512.18215 , year=
-
[40]
arXiv preprint arXiv:2602.20722 (2026) 3
Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning , author=. arXiv preprint arXiv:2602.20722 , year=
-
[41]
Group Sequence Policy Optimization
Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=
work page internal anchor Pith review arXiv
-
[42]
arXiv preprint arXiv:2504.19599 , year=
Gvpo: Group variance policy optimization for large language model post-training , author=. arXiv preprint arXiv:2504.19599 , year=
-
[43]
Soft adaptive policy optimization , author=. arXiv preprint arXiv:2511.20347 , year=
-
[44]
Advances in Neural Information Processing Systems , volume=
Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=
work page internal anchor Pith review arXiv
-
[46]
Advances in Neural Information Processing Systems , volume=
Measuring multimodal mathematical reasoning with math-vision dataset , author=. Advances in Neural Information Processing Systems , volume=
-
[47]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=
work page internal anchor Pith review arXiv
-
[48]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=
work page internal anchor Pith review arXiv
-
[49]
Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization , author=. arXiv preprint arXiv:2601.05242 , year=
-
[50]
arXiv preprint arXiv:2510.18927 , year=
Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping , author=. arXiv preprint arXiv:2510.18927 , year=
-
[51]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
2024
-
[52]
Advances in Neural Information Processing Systems , volume=
Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=
-
[53]
Zephyr: Direct distillation of lm alignment
Zephyr: Direct distillation of lm alignment , author=. arXiv preprint arXiv:2310.16944 , year=
-
[54]
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models,
Beyond human data: Scaling self-training for problem-solving with language models , author=. arXiv preprint arXiv:2312.06585 , year=
-
[55]
A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning
Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards , author=. arXiv preprint arXiv:2510.18814 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Advances in Neural Information Processing Systems , volume=
Agile: A novel reinforcement learning framework of llm agents , author=. Advances in Neural Information Processing Systems , volume=
-
[57]
The surprising effectiveness of negative reinforcement in llm reasoning , author=. arXiv preprint arXiv:2506.01347 , year=
-
[58]
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang
Reinforcement learning for reasoning in large language models with one training example , author=. arXiv preprint arXiv:2504.20571 , year=
-
[59]
Inftythink: Breaking the length limits of long-context reasoning in large language models
Inftythink: Breaking the length limits of long-context reasoning in large language models , author=. arXiv preprint arXiv:2503.06692 , year=
-
[60]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=
work page internal anchor Pith review arXiv
-
[61]
arXiv preprint arXiv:2602.12125 , year=
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation , author=. arXiv preprint arXiv:2602.12125 , year=
-
[62]
arXiv preprint arXiv:2603.10101 , year=
CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR , author=. arXiv preprint arXiv:2603.10101 , year=
-
[63]
IEEE transactions on knowledge and data engineering , volume=
Self-supervised learning: Generative or contrastive , author=. IEEE transactions on knowledge and data engineering , volume=. 2021 , publisher=
2021
-
[64]
Matharena: Evaluating llms on uncontaminated math competitions , author=. arXiv preprint arXiv:2505.23281 , year=
-
[65]
arXiv preprint arXiv:2407.13399v3 , year=
Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization , author=. arXiv preprint arXiv:2407.13399 , year=
-
[66]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-Coder: when the large language model meets programming--the rise of code intelligence , author=. arXiv preprint arXiv:2401.14196 , year=
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.