Mitigating overthinking in large reasoning models via difficulty-aware reinforcement learning.arXiv preprint arXiv:2601.21418

Qian Wan, Ziao Xu, Luona Wei, Xiaoxuan Shen, Jianwen Sun · 2026 · arXiv 2601.21418

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

OPSD functions primarily as a compression stage after RLVR in mathematical reasoning, preserving accuracy on correct rollouts but reducing it on incorrect ones, supporting a SFT-then-RLVR-then-OPSD pipeline.

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

cs.LG · 2026-03-05 · conditional · novelty 6.0

CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

cs.LG · 2026-06-07 · unverdicted · novelty 5.0

AdaGRPO gates GRPO reinforcement learning with supervised NLL using per-sample binary clips based on policy difficulty and reward discriminability, raising HR@10 from 11.01% to 12.18% while keeping hallucination below 0.22% on large-scale e-commerce data and showing A/B gains.

citing papers explorer

Showing 3 of 3 citing papers.

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models cs.AI · 2026-05-07 · unverdicted · none · ref 29
OPSD functions primarily as a compression stage after RLVR in mathematical reasoning, preserving accuracy on correct rollouts but reducing it on incorrect ones, supporting a SFT-then-RLVR-then-OPSD pipeline.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 20
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation cs.LG · 2026-06-07 · unverdicted · none · ref 21
AdaGRPO gates GRPO reinforcement learning with supervised NLL using per-sample binary clips based on policy difficulty and reward discriminability, raising HR@10 from 11.01% to 12.18% while keeping hallucination below 0.22% on large-scale e-commerce data and showing A/B gains.

Mitigating overthinking in large reasoning models via difficulty-aware reinforcement learning.arXiv preprint arXiv:2601.21418

fields

years

verdicts

representative citing papers

citing papers explorer