GRPO - LEAD : A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models

doi: 10 · 2025 · DOI 10.18653/v1/2025.emnlp-main.287

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards

cs.AI · 2026-06-21 · unverdicted · novelty 5.0

ACOER applies adaptive correct-only efficiency rewards in GRPO to avoid reward collapse, yielding higher accuracy and over 60% fewer tokens on math reasoning benchmarks.

citing papers explorer

Showing 3 of 3 citing papers.

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs cs.CL · 2026-06-30 · unverdicted · none · ref 126
RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning cs.LG · 2026-04-25 · unverdicted · none · ref 85
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards cs.AI · 2026-06-21 · unverdicted · none · ref 24
ACOER applies adaptive correct-only efficiency rewards in GRPO to avoid reward collapse, yielding higher accuracy and over 60% fewer tokens on math reasoning benchmarks.

GRPO - LEAD : A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models

fields

years

verdicts

representative citing papers

citing papers explorer