DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.
Drpo: Efficient reasoning via decoupled reward policy optimization
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
baseline 1polarities
baseline 1representative citing papers
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
PASS middleware independently standardizes process/outcome/format streams, derives value-homogeneous chunks, and converts cumulative returns to average value density, yielding consistent pass@1 gains over GRPO baselines in two domains and two signal paradigms.
ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.
ESPO adds on-the-fly early stopping to PPO rollouts for LLM math reasoning using cumulative surrogate regret, improving AIME, AMC, and MATH-500 scores over PPO while cutting over 20% rollout tokens on a 7B model.
citing papers explorer
-
Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking
DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.
-
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
-
Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners
PASS middleware independently standardizes process/outcome/format streams, derives value-homogeneous chunks, and converts cumulative returns to average value density, yielding consistent pass@1 gains over GRPO baselines in two domains and two signal paradigms.
-
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.
-
ESPO: Early-Stopping Proximal Policy Optimization
ESPO adds on-the-fly early stopping to PPO rollouts for LLM math reasoning using cumulative surrogate regret, improving AIME, AMC, and MATH-500 scores over PPO while cutting over 20% rollout tokens on a 7B model.