A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
hub
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
27 Pith papers cite this work. Polarity classification is still indexing.
abstract
AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.
hub tools
representative citing papers
Mem-W embeds historical trajectories and working memory as compact latent tokens into GUI agents' continuous context via a trajectory-to-latent compressor, yielding up to +30 point gains on navigation benchmarks.
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
DReST training makes RL agents and LLMs neutral to trajectory lengths and useful at goals, generalizing to halve shutdown influence probability in out-of-distribution tests.
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while showing some skills are internalized and others remain external.
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.
S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
citing papers explorer
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
-
Mem-W: Latent Memory-Native GUI Agents
Mem-W embeds historical trajectories and working memory as compact latent tokens into GUI agents' continuous context via a trajectory-to-latent compressor, yielding up to +30 point gains on navigation benchmarks.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
-
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
-
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
-
Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
DReST training makes RL agents and LLMs neutral to trajectory lengths and useful at goals, generalizing to halve shutdown influence probability in out-of-distribution tests.
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while showing some skills are internalized and others remain external.
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.
-
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
-
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
-
Ethics Testing: Proactive Identification of Generative AI System Harms
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
-
Target Policy Optimization
TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
-
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
VAPO achieves 60.4 on AIME 2024 with Qwen 32B, outperforming prior methods by over 10 points through targeted fixes for value bias, sequence length variation, and sparse rewards.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation
Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.