arXiv preprint arXiv:2603.08660 , year=

How far can unsupervised rlvr scale llm training?arXiv preprint arXiv:2603 · 2026 · arXiv 2603.08660

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Empirical evaluation on LiveCodeBench shows certainty-based RLIF yields early gains followed by output shortening and reasoning collapse, providing no advantage for RLVR initialization on code tasks.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.

Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

REINFORCE self-training on competitive programming tasks exhibits robust rise-then-collapse in pass@1; CARE, ES, and GRPO mitigate it in model-size-dependent ways across Qwen-2.5-3B/7B and a Gemma pilot.

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

RemoteZero: Geospatial Reasoning with Zero Human Annotations

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

RemoteZero replaces coordinate supervision with intrinsic semantic verification to enable box-free GRPO training and self-evolution for geospatial reasoning.

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

cs.LG · 2026-07-02 · unverdicted · novelty 5.0

Neuron-OPSD uses neuron activations to guide data selection and teacher construction for annotation-free on-policy self-distillation in LLMs, claiming better in-domain results without harming cross-domain performance or calibration.

Continual Self-Improvement with Lightweight Experiential Latent Memories

cs.LG · 2026-06-16 · unverdicted · novelty 5.0

Lightweight modular latent memories trained on self-generated rewards enable continual self-improvement in LLMs, outperforming raw ICL and matching offline training on math benchmarks.

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 5.0

GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

arXiv preprint arXiv:2603.08660 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer