Drpo: Efficient reasoning via decoupled reward policy optimization

Gang Li, Yan Chen, Ming Lin, Tianbao Yang · 2025 · arXiv 2510.04474

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

PASS middleware independently standardizes process/outcome/format streams, derives value-homogeneous chunks, and converts cumulative returns to average value density, yielding consistent pass@1 gains over GRPO baselines in two domains and two signal paradigms.

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.

ESPO: Early-Stopping Proximal Policy Optimization

cs.LG · 2026-05-28 · unverdicted · novelty 4.0

ESPO adds on-the-fly early stopping to PPO rollouts for LLM math reasoning using cumulative surrogate regret, improving AIME, AMC, and MATH-500 scores over PPO while cutting over 20% rollout tokens on a 7B model.

citing papers explorer

Showing 5 of 5 citing papers.

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking cs.CL · 2026-07-01 · unverdicted · none · ref 62
DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models cs.LG · 2026-05-10 · unverdicted · none · ref 18
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
Process Advantage Signal Shaping: A Paradigm-Agnostic Middleware for Process-Supervised RL in LLM Reasoners cs.AI · 2026-06-28 · unverdicted · none · ref 10
PASS middleware independently standardizes process/outcome/format streams, derives value-homogeneous chunks, and converts cumulative returns to average value density, yielding consistent pass@1 gains over GRPO baselines in two domains and two signal paradigms.
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning cs.AI · 2026-06-02 · unverdicted · none · ref 13
ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.
ESPO: Early-Stopping Proximal Policy Optimization cs.LG · 2026-05-28 · unverdicted · none · ref 7
ESPO adds on-the-fly early stopping to PPO rollouts for LLM math reasoning using cumulative surrogate regret, improving AIME, AMC, and MATH-500 scores over PPO while cutting over 20% rollout tokens on a 7B model.

Drpo: Efficient reasoning via decoupled reward policy optimization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer