hub Canonical reference

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu · 2025 · cs.LG · arXiv 2511.20347

Canonical reference. 70% of citing Pith papers cite this work as background.

32 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 32 citing papers arXiv PDF

abstract

Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 3

citation-polarity summary

background 7 use method 3

representative citing papers

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Near-Future Policy Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

GRID trains Qwen-based 4B models on a task-bank reward system of multi-select questions and regex targets to extract security KGs from CTI text, reporting 84.62% precision and 64.91% recall on 249 articles from five sources.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.

AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

cs.CV · 2026-04-12 · unverdicted · novelty 6.0 · 3 refs

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

cs.LG · 2026-02-10 · unverdicted · novelty 6.0

Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

cs.LG · 2026-02-04 · conditional · novelty 6.0

An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 5.0

LamPO introduces a pairwise decomposed advantage with confidence-aware weighting to replace scalar group advantages in group-relative policy optimization for reasoning models.

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

cs.CL · 2026-05-12 · unverdicted · novelty 5.0 · 2 refs

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

cs.RO · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.

citing papers explorer

Showing 32 of 32 citing papers.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 68 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning cs.LG · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 45 · internal anchor
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 43 · internal anchor
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Near-Future Policy Optimization cs.LG · 2026-04-22 · unverdicted · none · ref 6 · internal anchor
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models cs.SD · 2026-01-06 · unverdicted · none · ref 7 · internal anchor
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO cs.LG · 2026-05-29 · unverdicted · none · ref 8 · internal anchor
Smaller models provide temporally correlated policy-level diversity that serves as structured exploration for training larger models in GRPO, yielding accuracy gains such as +8.8% on AIME 24 with reduced compute via the S2L-PO framework.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 52 · internal anchor
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction cs.AI · 2026-05-15 · unverdicted · none · ref 6 · internal anchor
GRID trains Qwen-based 4B models on a task-bank reward system of multi-select questions and regex targets to extract security KGs from CTI text, reporting 84.62% precision and 64.91% recall on 249 articles from five sources.
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control cs.LG · 2026-05-12 · unverdicted · none · ref 11 · 2 links · internal anchor
Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards cs.CV · 2026-04-21 · unverdicted · none · ref 5 · internal anchor
AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning cs.CV · 2026-04-12 · unverdicted · none · ref 13 · 3 links · internal anchor
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR cs.LG · 2026-04-06 · unverdicted · none · ref 20 · internal anchor
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective cs.LG · 2026-02-10 · unverdicted · none · ref 6 · internal anchor
Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design cs.LG · 2026-02-04 · conditional · none · ref 5 · internal anchor
An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.
Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals cs.LG · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
Proposes Near-boundary Stochastic Rescue (NSR) as a stochastic modification to clipping in RLVR that recovers near-boundary signals and yields gains over baselines like DAPO and GSPO.
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 3 · internal anchor
LamPO introduces a pairwise decomposed advantage with confidence-aware weighting to replace scalar group advantages in group-relative policy optimization for reasoning models.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation cs.CL · 2026-05-12 · unverdicted · none · ref 65 · 2 links · internal anchor
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT cs.RO · 2026-05-09 · unverdicted · none · ref 28 · 2 links · internal anchor
ConSFT is a gradient-scaling fine-tuning objective for flow-matching VLAs that bounds parameter disruption via model-confidence weighting, yielding over 20% better capability retention than vanilla SFT on LIBERO and RoboTwin.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 20 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks cs.CV · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
Gym-V: A Unified Vision Environment System for Agentic Vision Research cs.CV · 2026-03-16 · unverdicted · none · ref 4 · internal anchor
Gym-V supplies 179 visual environments showing that observation scaffolding like captions and rules matters more for training success than the choice of RL algorithm.
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation cs.LG · 2026-05-01 · unverdicted · none · ref 57 · internal anchor
Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.
Rethinking Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 21 · 3 links · internal anchor
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
A Brief Overview: On-Policy Self-Distillation In Large Language Models cs.HC · 2026-05-18 · unverdicted · none · ref 8 · 2 links · internal anchor
This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 141 · internal anchor
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective cs.LG · 2026-05-13 · unreviewed · ref 10 · 2 links · internal anchor
Cost-Aware Learning cs.LG · 2026-04-30 · unreviewed · ref 8 · internal anchor
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens cs.CL · 2026-02-17 · unreviewed · ref 7 · internal anchor

Soft Adaptive Policy Optimization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer