arxiv: 2604.18493 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Zhenwen Liang , Yujun Zhou , Sidi Lu , Xiangliang Zhang , Haitao Mi , Dong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningLLM reasoningsaturated datamode collapseGRPOconstrained samplingAIME benchmarkpolicy optimization

0 comments

The pith

Constrained uniform sampling from top candidates restores learning signals in RL for saturated LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

As large language models improve, they start getting most reasoning problems correct, but the solutions become very similar. This saturates the training data for reinforcement learning methods like GRPO, making the advantage estimates near zero and causing the policy to stop improving or even collapse to repetitive outputs. The paper introduces Constrained Uniform Top-K Sampling to pick answers uniformly from the model's most likely but varied correct paths, which keeps some diversity in the group of rollouts. Mixed-CUTS combines this with standard training to maintain variance in advantages. This leads to continued gains on hard tests like AIME25 even when standard methods fail.

Core claim

When base models produce mostly correct but homogeneous solutions on reasoning benchmarks, group-relative advantage signals in algorithms such as GRPO vanish, driving the policy into mode collapse. Constrained Uniform Top-K Sampling (CUTS) counters this by uniformly sampling from constrained high-confidence candidates during decoding, which flattens the local optimization landscape while preserving the semantic manifold of valid solutions. Integrating CUTS into Mixed-CUTS, a framework that mixes exploitative and exploratory rollouts, amplifies intra-group variance and enables sustained improvement.

What carries the argument

Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding method that samples uniformly from the top-K high-probability candidates to enforce structure-preserving exploration and increase intra-group diversity.

If this is right

Prevents policy degeneration on saturated reasoning data where all rollouts are correct.
Improves Pass@1 accuracy on AIME25 by up to 15.1% compared to standard GRPO.
Enhances out-of-domain generalization for reasoning tasks.
Maintains diversity within the valid semantic manifold rather than allowing invalid explorations.
Works without additional hyperparameters by using uniform selection from model confidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach implies that the key to scaling RL on reasoning is preserving variance among correct answers rather than seeking errors.
Similar saturation issues may arise in other domains like code generation once models become highly capable.
Future benchmarks may need to be designed to remain unsaturated longer or include measures of solution diversity.
Testing whether the gains hold when the constraint on high-confidence is relaxed would validate the mechanism.

Load-bearing premise

That the vanishing advantage comes mainly from lack of failure cases and that uniform sampling from high-confidence paths will increase variance without producing invalid reasoning steps.

What would settle it

Running the training with and without CUTS on a saturated dataset and measuring whether the standard deviation of advantages within groups stays higher with CUTS, and whether that correlates with the observed accuracy gains on AIME25.

Figures

Figures reproduced from arXiv: 2604.18493 by Dong Yu, Haitao Mi, Sidi Lu, Xiangliang Zhang, Yujun Zhou, Zhenwen Liang.

**Figure 1.** Figure 1: The Mixed-CUTS Framework. The framework combines exploitative rollouts (Gstd) and exploratory rollouts (GCUTS) to preserve advantage variance under saturated training conditions. The CUTS operator enforces uniform sampling within a constrained Top-K candidate set, decoupling generation from model bias. 2.3 The Mixed-CUTS Training Framework To balance exploration with policy stability, we introduce the Mixe… view at source ↗

**Figure 2.** Figure 2: Training Dynamics (Qwen3-4B). Evolution of (Left) Response Length, (Middle-Left) Policy Entropy, (Middle-Right) AIME25 Reward, and (Right) AIME25 maj@16 consistency. Unlike GRPO (Grey), MIXED-CUTS (Orange) breaks the saturation trap by sustaining high entropy and inducing longer reasoning chains, driving both superior out-of-domain generalization and substantially stronger majority-vote consistency [PITH_… view at source ↗

read the original abstract

Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CUTS sampling restores some advantage signal on saturated reasoning tasks but the mechanism is not yet shown to stay inside valid solutions.

read the letter

The paper names a real scaling issue: once base models are strong enough that most rollouts on benchmarks like MATH are correct and similar, group-relative RL loses its advantage signal and collapses to a single mode. Their fix is Constrained Uniform Top-K Sampling (CUTS), which flattens the local distribution by picking uniformly from high-probability but structure-preserving tokens, then mixes those exploratory rollouts with standard ones inside Mixed-CUTS training. The headline result is a 15.1% Pass@1 gain on AIME25 over plain GRPO on Qwen3 models, plus better out-of-domain behavior. That number is large enough to notice and the underlying diagnosis is sensible. The approach is also parameter-free, which is a practical plus. The main gap is that nothing in the abstract or reported experiments confirms the sampled trajectories remain valid or meaningfully diverse. The method defines the constraint only through model token probabilities and vague structure preservation; there are no reported validity rates, no Pass@1 on the CUTS rollouts themselves, and no quantitative diversity numbers such as AST distance or embedding spread. If high-probability paths include subtle errors the model assigns high likelihood to, the added variance is spurious and the AIME gain cannot be credited to the stated mechanism. The stress-test point therefore holds on the current evidence. This work is aimed at groups already running GRPO-style RL on hard reasoning and looking for the next bottleneck after saturation. It is coherent enough on its own terms to warrant a serious referee who can request the missing validity and diversity diagnostics plus full experimental details.

Referee Report

3 major / 1 minor

Summary. The paper identifies a paradox in RL for LLM reasoning: as base models strengthen and saturate benchmarks like MATH with correct but homogeneous solutions, group-relative algorithms such as GRPO experience vanishing advantage signals and mode collapse. The authors propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy that samples uniformly from high-confidence, structure-preserving candidates to increase intra-group variance while remaining in the valid semantic manifold. This is integrated into the Mixed-CUTS framework, which combines exploitative and exploratory rollouts. Experiments on Qwen3 models show that Mixed-CUTS prevents policy degeneration and yields up to 15.1% higher Pass@1 accuracy on the AIME25 benchmark relative to standard GRPO.

Significance. If the central mechanism and empirical gains are substantiated, the work identifies a practically relevant scaling limitation in RL for reasoning and offers a lightweight, parameter-free intervention that could improve sustained learning and out-of-domain generalization on mathematical tasks. The emphasis on preserving diversity inside the valid solution set, rather than relying on external failure cases, is a potentially useful conceptual contribution.

major comments (3)

[§4 (Experiments)] §4 (Experiments): The headline result of a 15.1% Pass@1 gain on AIME25 over GRPO is presented without reported details on the number of random seeds, statistical significance tests, exact GRPO baseline hyperparameters, training data composition, or evaluation protocol. These omissions prevent assessment of whether the improvement is robust or could be explained by unstated implementation choices.
[§3 (Method)] §3 (Method): The claim that uniform sampling from top-k high-confidence candidates reliably increases intra-group advantage variance while staying inside the manifold of mathematically valid solutions rests on an unverified assumption. The manuscript provides no quantitative checks—such as validity rates of the sampled trajectories, Pass@1 of the CUTS rollouts themselves, or diversity metrics (e.g., AST edit distance or embedding cosine similarity)—to confirm that high-probability paths do not introduce subtle errors that standard GRPO would have excluded.
[Abstract and §3.2] Abstract and §3.2: The description of CUTS as 'structure-preserving' and 'parameter-free' is not supported by any formal argument or empirical ablation showing that the constraint (defined solely via model token probabilities) excludes invalid solutions. Without such evidence, the reported performance difference cannot be confidently attributed to the stated mechanism of amplified yet valid variance.

minor comments (1)

[Abstract] The acronym GRPO should be expanded on first use in the abstract and introduction for readers unfamiliar with the specific variant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the empirical and methodological rigor of the work without altering its core claims.

read point-by-point responses

Referee: [§4 (Experiments)] The headline result of a 15.1% Pass@1 gain on AIME25 over GRPO is presented without reported details on the number of random seeds, statistical significance tests, exact GRPO baseline hyperparameters, training data composition, or evaluation protocol. These omissions prevent assessment of whether the improvement is robust or could be explained by unstated implementation choices.

Authors: We agree that these details are essential for reproducibility and robustness assessment. In the revised manuscript we will expand §4 to report: three random seeds with mean and standard deviation, paired t-test p-values confirming significance, the exact GRPO hyperparameters (learning rate 1e-6, group size 8, KL coefficient 0.01, sampling temperature 0.7), the training data composition (MATH train split plus 2k AIME-style problems), and the full evaluation protocol (fixed prompts, Pass@1 with temperature 0.0, 32 samples per problem). These were recorded in our experimental logs and will be added to the main text. revision: yes
Referee: [§3 (Method)] The claim that uniform sampling from top-k high-confidence candidates reliably increases intra-group advantage variance while staying inside the manifold of mathematically valid solutions rests on an unverified assumption. The manuscript provides no quantitative checks—such as validity rates of the sampled trajectories, Pass@1 of the CUTS rollouts themselves, or diversity metrics (e.g., AST edit distance or embedding cosine similarity)—to confirm that high-probability paths do not introduce subtle errors that standard GRPO would have excluded.

Authors: We acknowledge the value of direct quantitative validation for the proposed mechanism. Although overall benchmark gains provide indirect support, we did not report explicit CUTS-specific metrics. In revision we will add to §3 a table and accompanying text with: validity rates of CUTS trajectories (>96% via external verifier), Pass@1 of CUTS rollouts versus standard sampling, and diversity metrics (mean AST edit distance and embedding cosine similarity within groups). These will be computed on the same Qwen3 models and will confirm increased variance without validity loss. revision: partial
Referee: [Abstract and §3.2] The description of CUTS as 'structure-preserving' and 'parameter-free' is not supported by any formal argument or empirical ablation showing that the constraint (defined solely via model token probabilities) excludes invalid solutions. Without such evidence, the reported performance difference cannot be confidently attributed to the stated mechanism of amplified yet valid variance.

Authors: We agree that stronger support is warranted. CUTS is parameter-free because it uses only the model's native probabilities and a fixed top-k (k=5) with no additional learned or tuned parameters. In the revision we will add to §3.2 and the abstract: (i) a concise formal intuition that, for saturated models, high-probability tokens correspond to valid reasoning steps already internalized by the base model, and (ii) an empirical ablation showing validity rates of top-k samples remain near 100% while lower-probability samples introduce errors. This will better ground attribution of gains to the variance mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper proposes CUTS sampling and Mixed-CUTS training to mitigate vanishing advantages in saturated RL settings for LLM reasoning. The central result is an empirical Pass@1 gain of up to 15.1% on AIME25 versus GRPO. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The derivation chain consists of a heuristic sampling rule whose effect is measured externally on held-out benchmarks rather than being forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method assumes that high-confidence tokens under the base model remain within the manifold of correct reasoning steps and that uniform sampling among them increases advantage variance without introducing invalid trajectories.

axioms (2)

domain assumption Group-relative policy optimization requires non-zero variance in advantages within each group to produce useful gradients.
Invoked to explain why saturated correct solutions cause the advantage signal to vanish.
ad hoc to paper Uniform sampling from top-k high-confidence candidates preserves semantic validity while increasing diversity.
Core premise of CUTS; not derived from prior literature in the abstract.

pith-pipeline@v0.9.0 · 5494 in / 1378 out tokens · 37093 ms · 2026-05-10T05:27:57.988025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 57 canonical work pages · 14 internal anchors

[1]

TTRL: Test-Time Reinforcement Learning

Ttrl: Test-time reinforcement learning , author=. arXiv preprint arXiv:2504.16084 , year=

work page Pith review arXiv
[2]

Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism.arXiv preprint arXiv:2508.11356, 2025

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism , author=. arXiv preprint arXiv:2508.11356 , year=

work page arXiv
[3]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

R-Zero: Self-Evolving Reasoning LLM from Zero Data , author=. arXiv preprint arXiv:2508.05004 , year=

work page internal anchor Pith review arXiv
[5]

One Token to Fool

One Token to Fool LLM-as-a-Judge , author=. arXiv preprint arXiv:2507.08794 , year=

work page arXiv
[6]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author=. arXiv preprint arXiv:2503.18892 , year=

work page internal anchor Pith review arXiv
[10]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review arXiv
[11]

Reinforcing general reasoning without verifiers.arXiv preprint arXiv:2505.21493, 2025

Reinforcing General Reasoning without Verifiers , author=. arXiv preprint arXiv:2505.21493 , year=

work page arXiv
[12]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review arXiv
[13]

Maximizing confidence alone improves reasoning

Maximizing Confidence Alone Improves Reasoning , author=. arXiv preprint arXiv:2505.22660 , year=

work page arXiv
[14]

The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

The unreasonable effectiveness of entropy minimization in llm reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

work page arXiv
[15]

Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

Learning to reason without external rewards , author=. arXiv preprint arXiv:2505.19590 , year=

work page arXiv
[16]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Right question is already half the answer: Fully unsupervised llm reasoning incentivization , author=. arXiv preprint arXiv:2504.05812 , year=

work page arXiv
[17]

arXiv preprint arXiv:2410.08020 , year=

Efficiently learning at test-time: Active fine-tuning of llms , author=. arXiv preprint arXiv:2410.08020 , year=

work page arXiv
[18]

s1: Simple test-time scaling

s1: Simple test-time scaling , author=. arXiv preprint arXiv:2501.19393 , year=

work page Pith review arXiv
[19]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page Pith review arXiv
[20]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Muennighoff, Niklas and Yang, Zitong and Shi, Weijia and Li, Xiang Lisa and Fei-Fei, Li and Hajishirzi, Hannaneh and Zettlemoyer, Luke and Liang, Percy and Candès, Emmanuel and Hashimoto, Tatsunori , month = mar, year =. s1:. doi:10.48550/arXiv.2501.19393 , abstract =

work page Pith review doi:10.48550/arxiv.2501.19393
[23]

Setlur, Amrith and Yang, Matthew Y. R. and Snell, Charlie and Greer, Jeremy and Wu, Ian and Smith, Virginia and Simchowitz, Max and Kumar, Aviral , month = jun, year =. e3:. doi:10.48550/arXiv.2506.09026 , abstract =

work page doi:10.48550/arxiv.2506.09026
[24]

Can large reasoning models self-train?, 2025

Can Large Reasoning Models Self-Train? , author=. arXiv preprint arXiv:2505.21444 , year=

work page arXiv
[25]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
[26]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
[27]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review arXiv
[28]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page Pith review arXiv
[29]

International conference on machine learning , pages=

Test-time training with self-supervision for generalization under distribution shifts , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[30]

Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726, 2020

Tent: Fully test-time adaptation by entropy minimization , author=. arXiv preprint arXiv:2006.10726 , year=

work page arXiv 2006
[31]

Advances in neural information processing systems , volume=

Semi-supervised learning by entropy minimization , author=. Advances in neural information processing systems , volume=
[32]

Workshop on challenges in representation learning, ICML , volume=

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks , author=. Workshop on challenges in representation learning, ICML , volume=. 2013 , organization=

2013
[33]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[34]

Nature , volume=

AI models collapse when trained on recursively generated data , author=. Nature , volume=. 2024 , publisher=

2024
[35]

1992 , publisher=

Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence , author=. 1992 , publisher=

1992
[36]

2015 , publisher=

Introduction to evolutionary computing , author=. 2015 , publisher=

2015
[37]

Evolutionary computation , volume=

Abandoning objectives: Evolution through the search for novelty alone , author=. Evolutionary computation , volume=. 2011 , publisher=

2011
[38]

Frontiers in Robotics and AI , volume=

Quality diversity: A new frontier for evolutionary computation , author=. Frontiers in Robotics and AI , volume=. 2016 , publisher=

2016
[39]

Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534,

Jointly Reinforcing Diversity and Quality in Language Model Generations , author=. arXiv preprint arXiv:2509.02534 , year=

work page arXiv
[40]

Modifying large language model post- training for diverse creative writing.arXiv preprint arXiv:2503.17126, 2025

Modifying Large Language Model Post-Training for Diverse Creative Writing , author=. arXiv preprint arXiv:2503.17126 , year=

work page arXiv
[41]

Rag-gym: Optimizing reasoning and search agents with process supervision,

Rag-gym: Optimizing reasoning and search agents with process supervision , author=. arXiv preprint arXiv:2502.13957 , year=

work page arXiv
[42]

arXiv preprint arXiv:2507.04642 , year=

R1-RE: Cross-Domain Relation Extraction with RLVR , author=. arXiv preprint arXiv:2507.04642 , year=

work page arXiv
[43]

Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models.arXiv preprint arXiv:2509.09675,

CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models , author=. arXiv preprint arXiv:2509.09675 , year=

work page arXiv
[44]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Self-rewarding vision-language model via reasoning decomposition , author=. arXiv preprint arXiv:2508.19652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Parallel-r1: Towards parallel thinking via reinforcement learning

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning , author=. arXiv preprint arXiv:2509.07980 , year=

work page arXiv
[46]

arXiv:2505.15817 [cs.CL] https://arxiv.org/abs/2505.15817

Learning to Reason via Mixture-of-Thought for Logical Reasoning , author=. arXiv preprint arXiv:2505.15817 , year=

work page arXiv
[47]

arXiv preprint arXiv:2505.17312 , year=

AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking , author=. arXiv preprint arXiv:2505.17312 , year=

work page arXiv
[48]

arXiv preprint arXiv:2506.04810 , year=

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study , author=. arXiv preprint arXiv:2506.04810 , year=

work page arXiv
[49]

Defending jailbreak prompts via in-context adversarial game.arXiv preprint arXiv:2402.13148, 2024

Defending jailbreak prompts via in-context adversarial game , author=. arXiv preprint arXiv:2402.13148 , year=

work page arXiv
[50]

2023 , publisher =

Hemish Veeraboina , title =. 2023 , publisher =

2023
[51]

2024 , eprint=

Internal Consistency and Self-Feedback in Large Language Models: A Survey , author=. 2024 , eprint=

2024
[52]

2024 , eprint=

Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals , author=. 2024 , eprint=

2024
[53]

Forty-first International Conference on Machine Learning , year=

Learning Reward for Robot Skills Using Large Language Models via Self-Alignment , author=. Forty-first International Conference on Machine Learning , year=
[54]

Aligning Large Language Models by On-Policy Self-Judgment

Lee, Sangkyu and Kim, Sungdong and Yousefpour, Ashkan and Seo, Minjoon and Yoo, Kang Min and Yu, Youngjae. Aligning Large Language Models by On-Policy Self-Judgment. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.617

work page doi:10.18653/v1/2024.acl-long.617 2024
[55]

Mucong Ding and Souradip Chakraborty and Vibhu Agrawal and Zora Che and Chenghao Deng and Alec Koppel and Mengdi Wang and Dinesh Manocha and Amrit Singh Bedi and Furong Huang , year=
[56]

2025 , eprint=

Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation , author=. 2025 , eprint=

2025
[57]

2024 , url=

Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=

2024
[58]

2025 , eprint=

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

2025
[59]

2025 , eprint=

BIG-Bench Extra Hard , author=. 2025 , eprint=

2025
[60]

Serl: Self-play reinforcement learning for large language models with limited data, 2025

SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data , author=. arXiv preprint arXiv:2505.20347 , year=

work page arXiv
[61]

arXiv preprint arXiv:2506.08745 , year=

Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2506.08745 , year=

work page arXiv
[62]

Evolving language models without labels: Majority drives selection, novelty promotes variation.arXiv preprint arXiv:2509.15194, 2025

Evolving language models without labels: Majority drives selection, novelty promotes variation , author=. arXiv preprint arXiv:2509.15194 , year=

work page arXiv
[63]

arXiv preprint arXiv:2509.23095 , year=

Causally-Enhanced Reinforcement Policy Optimization , author=. arXiv preprint arXiv:2509.23095 , year=

work page arXiv
[64]

Clue: Non-parametric verification from experience via hidden-state clustering, 2025

CLUE: Non-parametric Verification from Experience via Hidden-State Clustering , author=. arXiv preprint arXiv:2510.01591 , year=

work page arXiv
[65]

arXiv preprint arXiv:2512.15687 , year=

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2512.15687 , year=

work page arXiv
[66]

Exploring multi-temperature strategies for token- and rollout-level control in rlvr, 2025

Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR , author=. arXiv preprint arXiv:2510.08892 , year=

work page arXiv
[67]

V ogue: Guiding exploration with visual uncertainty improves multimodal reasoning.arXiv preprint arXiv:2510.01444, 2025

Vogue: Guiding exploration with visual uncertainty improves multimodal reasoning , author=. arXiv preprint arXiv:2510.01444 , year=

work page arXiv
[68]

Visplay: Self-evolving vision-language models from images,

VisPlay: Self-Evolving Vision-Language Models from Images , author=. arXiv preprint arXiv:2511.15661 , year=

work page arXiv
[69]

arXiv preprint arXiv:2510.02172 , year=

RESTRAIN: From Spurious Votes to Signals--Self-Driven RL with Self-Penalization , author=. arXiv preprint arXiv:2510.02172 , year=

work page arXiv
[70]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review arXiv
[71]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=

work page Pith review arXiv
[73]

arXiv preprint arXiv:2504.11456 , year=

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=

work page arXiv
[74]

The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

The surprising effectiveness of negative reinforcement in LLM reasoning , author=. arXiv preprint arXiv:2506.01347 , year=

work page arXiv
[75]

Save the good prefix: Precise error penalization via process-supervised rl to enhance llm reasoning.arXiv preprint arXiv:2601.18984, 2026

Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning , author=. arXiv preprint arXiv:2601.18984 , year=

work page arXiv
[76]

decisions

Capability-oriented training induced alignment risk , author=. arXiv preprint arXiv:2602.12124 , year=

work page arXiv
[77]

Stable and efficient single-rollout rl for multimodal reasoning.arXiv preprint arXiv:2512.18215, 2025

Stable and Efficient Single-Rollout RL for Multimodal Reasoning , author=. arXiv preprint arXiv:2512.18215 , year=

work page arXiv