Skip-Connected Policy Optimization for Implicit Advantage

· 2026 · cs.LG · arXiv 2604.08690

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Group Relative Policy Optimization (GRPO) has proven effective in RLVR by using outcome-based rewards. While fine-grained dense rewards can theoretically improve performance, we reveal that under practical sampling budgets, Monte Carlo estimation yields high-variance and sign-inconsistent advantages for early reasoning tokens, paradoxically underperforming outcome-only GRPO. We propose Skip-Connected Optimization (SKPO), which decomposes reasoning into upstream and downstream phases: upstream receives dense rewards from downstream Monte Carlo sampling with single-stream optimization; downstream maintains group-relative optimization, where a skip connection concatenates the upstream segment with the original problem, enabling the model to leverage helpful upstream reasoning while preserving the freedom to bypass flawed reasoning through direct problem access. Experiments demonstrate improvements of 3.91% and 6.17% relative gains over the strongest baselines on Qwen2.5-Math-7B and Llama-3.2-3B respectively across mathematical benchmarks and out-of-domain tasks including general reasoning and code generation. Further analysis reveals an implicit advantage: SKPO generates trajectories with higher intermediate-step quality even when matched for final correctness.

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

citing papers explorer

Showing 1 of 1 citing paper.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control cs.LG · 2026-05-12 · unverdicted · none · ref 7 · 2 links · internal anchor
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

Skip-Connected Policy Optimization for Implicit Advantage

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer