arXiv preprint arXiv:2603.08660 , year=

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al · 2026 · arXiv 2603.08660

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.

RemoteZero: Geospatial Reasoning with Zero Human Annotations

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

RemoteZero replaces coordinate supervision with intrinsic semantic verification to enable box-free GRPO training and self-evolution for geospatial reasoning.

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

cs.LG · 2026-07-02 · unverdicted · novelty 5.0

Neuron-OPSD uses neuron activations to guide data selection and teacher construction for annotation-free on-policy self-distillation in LLMs, claiming better in-domain results without harming cross-domain performance or calibration.

Continual Self-Improvement with Lightweight Experiential Latent Memories

cs.LG · 2026-06-16 · unverdicted · novelty 5.0

Lightweight modular latent memories trained on self-generated rewards enable continual self-improvement in LLMs, outperforming raw ICL and matching offline training on math benchmarks.

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 5.0

GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.

When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

citing papers explorer

Showing 10 of 10 citing papers.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 10
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
OPRD: On-Policy Representation Distillation cs.LG · 2026-06-04 · unverdicted · none · ref 11
OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting cs.LG · 2026-05-19 · unverdicted · none · ref 41
TTRL gains are reinterpreted as mostly sharpening rather than learning, with an identified extinction window causing net corruption; TTRL-Guard mitigates via FRS, MPS, and RCSU for improved pass@1.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information cs.LG · 2026-05-12 · unverdicted · none · ref 6
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
RemoteZero: Geospatial Reasoning with Zero Human Annotations cs.CV · 2026-05-06 · unverdicted · none · ref 3
RemoteZero replaces coordinate supervision with intrinsic semantic verification to enable box-free GRPO training and self-evolution for geospatial reasoning.
Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation cs.LG · 2026-07-02 · unverdicted · none · ref 4
Neuron-OPSD uses neuron activations to guide data selection and teacher construction for annotation-free on-policy self-distillation in LLMs, claiming better in-domain results without harming cross-domain performance or calibration.
Continual Self-Improvement with Lightweight Experiential Latent Memories cs.LG · 2026-06-16 · unverdicted · none · ref 13
Lightweight modular latent memories trained on self-generated rewards enable continual self-improvement in LLMs, outperforming raw ICL and matching offline training on math benchmarks.
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling cs.LG · 2026-06-03 · unverdicted · none · ref 9
GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-25 · unverdicted · none · ref 15
RLAVR uses the Corrective Advantage Gap metric and CARE policy to actively acquire ground-truth labels for key samples, stabilizing RLVR training and boosting performance with limited annotation budgets.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 77 · 2 links
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

arXiv preprint arXiv:2603.08660 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer