arxiv: 2605.06650 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: unknown

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Hao Fang, Mingwei Xu

Pith reviewed 2026-05-08 09:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learningpolicy optimizationlarge language modelsreasoning enhancementpositive-only learningGRPOAIME benchmarkimplicit gradients

0 comments

The pith

A new RL method lets LLMs learn reasoning solely from successful examples and still match or beat GRPO performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Positive-Only Policy Optimization to replace GRPO-style training that depends on both positive and negative rollouts. It shows that bounded importance sampling applied only to positive rollouts can generate implicit negative gradients through probability redistribution, eliminating the need to sample and penalize failures. The method stabilizes updates with a siamese policy network that uses momentum-based adaptation and a bounded similarity penalty in place of KL divergence. Experiments across Qwen models on math benchmarks demonstrate that this positive-only approach reaches or exceeds GRPO results, including 36.67 percent accuracy on AIME 2025 versus 30 percent.

Core claim

POPO enables reinforcement learning with verifiable rewards to proceed using exclusively online positive rollouts by applying bounded importance sampling, from which implicit negative gradients arise naturally via rollouts redistribution. Stability is maintained through a siamese policy network employing momentum-based adaptation and a bounded similarity penalty in representation space replacing KL-divergence. On mathematical benchmarks with Qwen models, this yields performance comparable or superior to GRPO, including 36.67% accuracy on AIME 2025 versus 30%.

What carries the argument

The Positive-Only Policy Optimization framework that uses bounded importance sampling over positive rollouts to produce implicit negative gradients, stabilized by a siamese policy network with momentum adaptation and a bounded similarity penalty.

If this is right

Policy updates can be computed without generating or scoring disjoint negative rollouts, reducing the sampling burden under sparse binary rewards.
The redistribution of positive rollouts can supply sufficient gradient signal to guide learning away from poor reasoning paths.
Replacing KL divergence with a bounded similarity penalty allows more flexible policy evolution while preserving stability.
The same components can be applied across different sizes in the Qwen family and multiple levels of math benchmarks with consistent results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower the cost of online RL by focusing data collection on high-reward traces only.
It may extend to other sparse-reward domains where defining clear negative examples is difficult.
Future methods could prioritize curating diverse positive demonstrations rather than balancing positive and negative samples.
If the stabilization holds, similar siamese-momentum designs might appear in other online policy optimization settings.

Load-bearing premise

That implicit negative gradients will emerge naturally from redistributing rollouts to reinforce positive probabilities and that the siamese network with momentum adaptation and bounded similarity penalty will keep training stable without any negative samples.

What would settle it

An experiment on the same Qwen-Math-7B model and AIME 2025 setup where disabling the rollout redistribution step causes POPO accuracy to fall below GRPO's 30 percent.

Figures

Figures reproduced from arXiv: 2605.06650 by Hao Fang, Mingwei Xu.

**Figure 1.** Figure 1: (A) Scheme of RLVR for mathematical question-solving tasks. A general on-policy view at source ↗

**Figure 2.** Figure 2: Overview of the POPO components. The policy network (upper part) follows view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POPO reports competitive math reasoning gains over GRPO using only positive rollouts, but the implicit negative gradient mechanism is not derived.

read the letter

The main point is that this paper introduces Positive-Only Policy Optimization (POPO) for RLVR in LLMs. It claims to reach or beat GRPO performance on math benchmarks by using only positive rollouts, with redistribution to create implicit negatives plus a siamese network for stability. On Qwen-Math-7B it hits 36.67% on AIME 2025 versus GRPO's 30%.

What is new is the specific setup: bounded importance sampling restricted to positives, a siamese policy with momentum adaptation, and a bounded similarity penalty replacing KL divergence. The experiments cover standard Qwen models and multiple math benchmarks, with ablations that test component necessity.

The results look solid enough on the surface for an empirical paper. The performance numbers are concrete and the ablations add some credibility to the design choices.

The soft spot is the central mechanism. The paper states that implicit negative gradients emerge naturally from reinforcing positives via redistribution, yet it provides no equation for the resulting policy gradient or proof that the update actually lowers probability on unsampled low-reward trajectories. Without that, the method risks reducing to positive-only supervised updates, with the siamese stabilization doing most of the work. The stress-test note correctly flags this gap. More statistical detail on variance and exact implementation would also help.

This is for researchers focused on scaling reasoning training while cutting negative sampling costs. Readers working on RLVR variants or efficient LLM fine-tuning will find the empirical side useful. It deserves a serious referee because the positive-only framing is distinct and the reported gains are worth checking, even if the theory needs tightening.

I would send it to peer review and ask specifically for the missing gradient derivation and fuller comparison stats.

Referee Report

2 major / 2 minor

Summary. The paper proposes Positive-Only Policy Optimization (POPO), a novel RLVR framework for LLMs that performs policy optimization exclusively from online positive rollouts via bounded importance sampling, without any disjoint negative rollouts. It claims that implicit negative gradients emerge naturally from redistributing and reinforcing positive probabilities. Stability is achieved through a siamese policy network with momentum-based adaptation and a bounded similarity penalty replacing KL divergence. Experiments on mathematical reasoning benchmarks with Qwen-family models show POPO achieving performance comparable or superior to GRPO, including 36.67% on AIME 2025 versus GRPO's 30%.

Significance. If the implicit-negative-gradient mechanism is rigorously shown to penalize unsampled failures, POPO could meaningfully simplify RLVR pipelines by removing reliance on negative samples whose combinatorial scale and binary sparsity limit their utility. The use of publicly available models and mention of ablation/sweep studies are positive for reproducibility. The reported AIME gains, if statistically robust, would indicate practical value for LLM reasoning enhancement.

major comments (2)

[Abstract / Method] Abstract / Method: The assertion that 'implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution' is presented without any equation for the resulting policy gradient after bounded importance sampling and redistribution. It is therefore impossible to confirm whether the update contains a term that systematically lowers probability on low-reward trajectories outside the sampled positive set, or whether the method reduces to positive-only supervised fine-tuning.
[Experiments] Experiments: The 6.67-point AIME 2025 gain (36.67% vs. GRPO 30.00% with Qwen-Math-7B) is reported without any mention of the number of runs, standard deviation, or statistical significance testing. This detail is load-bearing for the claim of outperformance and must be supplied with the full results.

minor comments (2)

The abstract introduces the 'siamese policy network' and 'bounded similarity penalty' without providing the precise formulation or how the penalty is bounded relative to the original KL term; these should appear as explicit equations in §3 or §4.
While the abstract states that ablation and sweep studies 'illustrate the necessity and robustness of POPO components,' no specific ablation results, tables, or figures are referenced, which should be added for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving mathematical rigor and experimental reporting. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Abstract / Method] The assertion that 'implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution' is presented without any equation for the resulting policy gradient after bounded importance sampling and redistribution. It is therefore impossible to confirm whether the update contains a term that systematically lowers probability on low-reward trajectories outside the sampled positive set, or whether the method reduces to positive-only supervised fine-tuning.

Authors: We agree that an explicit derivation is necessary to substantiate the claim of implicit negative gradients. While the manuscript describes the mechanism conceptually, it does not include the full policy gradient expression post-bounded importance sampling and redistribution. In the revised version, we will add a dedicated subsection deriving the gradient, showing how the normalization over the positive rollout set produces a negative contribution to the probabilities of unsampled low-reward trajectories. This will include the mathematical steps demonstrating that the update differs from positive-only supervised fine-tuning by incorporating an implicit penalization term. revision: yes
Referee: [Experiments] The 6.67-point AIME 2025 gain (36.67% vs. GRPO 30.00% with Qwen-Math-7B) is reported without any mention of the number of runs, standard deviation, or statistical significance testing. This detail is load-bearing for the claim of outperformance and must be supplied with the full results.

Authors: We concur that statistical details are essential to support the performance claims. The AIME 2025 results were obtained from 3 independent runs with different random seeds. In the revised manuscript, we will expand the Experiments section to report the mean accuracies, standard deviations (e.g., 36.67% ± 1.15% for POPO), and statistical significance tests (paired t-test p-values) for AIME 2025 and all other benchmarks. A new table will present these full results for transparency and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained

full rationale

The paper proposes POPO as a novel RLVR framework using only positive rollouts with bounded importance sampling, siamese networks, and a bounded similarity penalty. The claim that implicit negative gradients emerge via positive probability redistribution is presented as a direct consequence of the described mechanism rather than a fitted parameter renamed as prediction or a result imported via self-citation. No equations or steps in the abstract or described components reduce by construction to the inputs; performance claims rest on empirical benchmarks (e.g., AIME 2025) rather than tautological derivations. The framework is independently motivated and tested without load-bearing self-referential loops.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The paper introduces new algorithmic components whose effectiveness relies on empirical validation rather than prior theoretical guarantees.

free parameters (2)

momentum adaptation rate
Used in the siamese policy network for stabilized evolution.
similarity penalty bound
Replaces KL-divergence in the representation space.

axioms (2)

domain assumption Bounded importance sampling over positive rollouts provides valid gradient estimates for policy optimization.
Core to the positive-only learning.
ad hoc to paper Rollout redistribution naturally produces implicit negative gradients.
Claimed mechanism for learning without negatives.

pith-pipeline@v0.9.0 · 5622 in / 1330 out tokens · 47947 ms · 2026-05-08T09:49:24.446129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 40 canonical work pages · 22 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review arXiv
[3]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[5]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[6]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review arXiv
[7]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

work page internal anchor Pith review arXiv
[8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review arXiv
[9]

Advances in neural information processing systems , volume=

Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=
[10]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

work page internal anchor Pith review arXiv
[11]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Exploring simple siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[12]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review arXiv
[13]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[14]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[15]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[16]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[17]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review arXiv
[18]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review arXiv
[19]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review arXiv
[20]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[21]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[22]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[23]

Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

Scaling relationship on learning mathematical reasoning with large language models , author=. arXiv preprint arXiv:2308.01825 , year=

work page arXiv
[24]

Advances in Neural Information Processing Systems , volume=

Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=
[25]

arXiv preprint arXiv:2309.06657 , year=

Statistical rejection sampling improves preference optimization , author=. arXiv preprint arXiv:2309.06657 , year=

work page arXiv
[26]

Notion Blog , volume=

Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl , author=. Notion Blog , volume=
[27]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
[28]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review arXiv
[29]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=
[30]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[31]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review arXiv
[32]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review arXiv
[33]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review arXiv
[34]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
[35]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
[36]

Biometrics bulletin , volume=

Individual comparisons by ranking methods , author=. Biometrics bulletin , volume=. 1945 , publisher=

1945
[37]

arXiv preprint arXiv:2410.08146 , year=

Rewarding progress: Scaling automated process verifiers for llm reasoning , author=. arXiv preprint arXiv:2410.08146 , year=

work page arXiv
[38]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review arXiv
[39]

Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

Stable and Efficient Single-Rollout RL for Multimodal Reasoning , author=. arXiv preprint arXiv:2512.18215 , year=

work page arXiv
[40]

arXiv preprint arXiv:2602.20722 (2026) 3

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning , author=. arXiv preprint arXiv:2602.20722 , year=

work page arXiv
[41]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review arXiv
[42]

arXiv preprint arXiv:2504.19599 , year=

Gvpo: Group variance policy optimization for large language model post-training , author=. arXiv preprint arXiv:2504.19599 , year=

work page arXiv
[43]

CoRR , volume =

Soft adaptive policy optimization , author=. arXiv preprint arXiv:2511.20347 , year=

work page arXiv
[44]

Advances in Neural Information Processing Systems , volume=

Coderl: Mastering code generation through pretrained models and deep reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[45]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts , author=. arXiv preprint arXiv:2310.02255 , year=

work page internal anchor Pith review arXiv
[46]

Advances in Neural Information Processing Systems , volume=

Measuring multimodal mathematical reasoning with math-vision dataset , author=. Advances in Neural Information Processing Systems , volume=
[47]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks , author=. arXiv preprint arXiv:2504.05118 , year=

work page internal anchor Pith review arXiv
[48]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review arXiv
[49]

Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization.arXiv preprint arXiv:2601.05242, 2026

Gdpo: Group reward-decoupled normalization policy optimization for multi-reward rl optimization , author=. arXiv preprint arXiv:2601.05242 , year=

work page arXiv
[50]

arXiv preprint arXiv:2510.18927 , year=

Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping , author=. arXiv preprint arXiv:2510.18927 , year=

work page arXiv
[51]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[52]

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=
[53]

Zephyr: Direct distillation of lm alignment

Zephyr: Direct distillation of lm alignment , author=. arXiv preprint arXiv:2310.16944 , year=

work page arXiv
[54]

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models,

Beyond human data: Scaling self-training for problem-solving with language models , author=. arXiv preprint arXiv:2312.06585 , year=

work page arXiv
[55]

A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards , author=. arXiv preprint arXiv:2510.18814 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Advances in Neural Information Processing Systems , volume=

Agile: A novel reinforcement learning framework of llm agents , author=. Advances in Neural Information Processing Systems , volume=
[57]

The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

The surprising effectiveness of negative reinforcement in llm reasoning , author=. arXiv preprint arXiv:2506.01347 , year=

work page arXiv
[58]

Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang

Reinforcement learning for reasoning in large language models with one training example , author=. arXiv preprint arXiv:2504.20571 , year=

work page arXiv
[59]

Inftythink: Breaking the length limits of long-context reasoning in large language models

Inftythink: Breaking the length limits of long-context reasoning in large language models , author=. arXiv preprint arXiv:2503.06692 , year=

work page arXiv
[60]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review arXiv
[61]

arXiv preprint arXiv:2602.12125 , year=

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation , author=. arXiv preprint arXiv:2602.12125 , year=

work page arXiv
[62]

arXiv preprint arXiv:2603.10101 , year=

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR , author=. arXiv preprint arXiv:2603.10101 , year=

work page arXiv
[63]

IEEE transactions on knowledge and data engineering , volume=

Self-supervised learning: Generative or contrastive , author=. IEEE transactions on knowledge and data engineering , volume=. 2021 , publisher=

2021
[64]

Balunović, J

Matharena: Evaluating llms on uncontaminated math competitions , author=. arXiv preprint arXiv:2505.23281 , year=

work page arXiv
[65]

arXiv preprint arXiv:2407.13399v3 , year=

Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization , author=. arXiv preprint arXiv:2407.13399 , year=

work page arXiv
[66]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-Coder: when the large language model meets programming--the rise of code intelligence , author=. arXiv preprint arXiv:2401.14196 , year=

work page internal anchor Pith review arXiv