Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

· 2026 · cs.LG · arXiv 2604.11056

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

representative citing papers

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

ARCA assigns token credit in LoRA-based LLM RL from the norm of adapter-induced hidden state changes, yielding non-degenerate distributions and competitive performance on MATH tasks with Qwen3-1.7B under GRPO.

From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

cs.LG · 2026-05-13 · conditional · novelty 7.0

AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

On the Position Bias of On-Policy Distillation

cs.LG · 2026-06-21 · unverdicted · novelty 6.0

Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

citing papers explorer

Showing 5 of 5 citing papers.

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate cs.LG · 2026-05-29 · unverdicted · none · ref 33 · internal anchor
ARCA assigns token credit in LoRA-based LLM RL from the norm of adapter-induced hidden state changes, yielding non-degenerate distributions and competitive performance on MATH tasks with Qwen3-1.7B under GRPO.
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning cs.LG · 2026-05-13 · conditional · none · ref 35 · internal anchor
AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
On the Position Bias of On-Policy Distillation cs.LG · 2026-06-21 · unverdicted · none · ref 13 · internal anchor
Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 32 · internal anchor
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

fields

years

verdicts

representative citing papers

citing papers explorer