hub Canonical reference

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, Chelsea Finn · 2023

Canonical reference. 73% of citing Pith papers cite this work as background.

39 Pith papers citing it

Background 73% of classified citations

browse 39 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 8 method 3

citation-polarity summary

background 8 use method 3

representative citing papers

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

cs.CL · 2026-05-12 · conditional · novelty 7.0

LLM simulators exhibit near-zero selective response to targeted misconception feedback and behave sycophantically, but SFT and SFS-aligned RL improve this property.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

cs.LG · 2026-05-12 · conditional · novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.

IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

cs.LG · 2026-04-12 · unverdicted · novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

Hierarchical Variational Policies for Reward-Guided Diffusion

cs.LG · 2026-05-20 · conditional · novelty 6.0

A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.

General Preference Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 3 refs

GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.

Alignment Dynamics in LLM Fine-Tuning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

G-Zero: Self-Play for Open-Ended Generation from Zero Data

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.

Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.

RVPO: Risk-Sensitive Alignment via Variance Regularization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.

Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence

stat.ME · 2026-05-06 · unverdicted · novelty 6.0

HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.

citing papers explorer

Showing 39 of 39 citing papers.

Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 27
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators cs.CL · 2026-05-12 · conditional · none · ref 23
LLM simulators exhibit near-zero selective response to targeted misconception feedback and behave sycophantically, but SFT and SFS-aligned RL improve this property.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation cs.LG · 2026-05-12 · conditional · none · ref 18
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning cs.LG · 2026-05-11 · unverdicted · none · ref 29
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 42 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models cs.CV · 2026-05-07 · unverdicted · none · ref 26
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 32
ICU-Bench is a new continual unlearning benchmark for MLLMs using 1000 privacy profiles, 9500 images, and 100 forget tasks, showing existing methods fail to balance forgetting, utility, and scalability.
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning cs.LG · 2026-04-22 · unverdicted · none · ref 57
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning cs.LG · 2026-04-12 · unverdicted · none · ref 5
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
CLORE: Content-Level Optimization for Reasoning Efficiency cs.AI · 2026-05-21 · unverdicted · none · ref 37
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
Hierarchical Variational Policies for Reward-Guided Diffusion cs.LG · 2026-05-20 · conditional · none · ref 36
A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 24
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
General Preference Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 6 · 3 links
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
Alignment Dynamics in LLM Fine-Tuning cs.LG · 2026-05-18 · unverdicted · none · ref 5
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning cs.CV · 2026-05-12 · unverdicted · none · ref 34
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment cs.AI · 2026-05-12 · unverdicted · none · ref 30
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 37 · 2 links
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 61
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
G-Zero: Self-Play for Open-Ended Generation from Zero Data cs.LG · 2026-05-11 · unverdicted · none · ref 21
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 28
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 2
S-trace adds sparse eligibility traces to RLVR that mask low-entropy tokens, outperforming GRPO by 0.49-3.16% pass@16 on Qwen3 models while improving sample and token efficiency.
Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning cs.AI · 2026-05-07 · unverdicted · none · ref 14
A contrastive visual forgetting technique constrained to the null space of retained knowledge enables targeted unlearning of visual concepts in MLLMs while preserving non-target visual and all textual knowledge.
RVPO: Risk-Sensitive Alignment via Variance Regularization cs.LG · 2026-05-07 · unverdicted · none · ref 7
RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence stat.ME · 2026-05-06 · unverdicted · none · ref 4
HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 39 · 2 links
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
Cat-DPO: Category-Adaptive Safety Alignment cs.CL · 2026-04-19 · unverdicted · none · ref 9
Cat-DPO applies per-category adaptive safety margins during direct preference optimization to reduce variance in safety across harm categories.
LPM 1.0: Video-based Character Performance Model cs.CV · 2026-04-09 · unverdicted · none · ref 63
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Benchmarking and Mitigating Sycophancy in Medical Vision Language Models cs.CV · 2025-09-26 · unverdicted · none · ref 26 · 2 links
The paper benchmarks sycophancy in medical VLMs using hierarchical VQA templates and proposes VIPER to filter non-evidence social cues, reducing sycophancy while preserving interpretability.
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation cs.LG · 2026-05-21 · unverdicted · none · ref 20
A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization cs.LG · 2026-05-20 · unverdicted · none · ref 3
Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.
When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited cs.LG · 2026-05-16 · unverdicted · none · ref 47
Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 74
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 41
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
Goal-Conditioned Supervised Learning for LLM Fine-Tuning cs.LG · 2026-05-08 · unverdicted · none · ref 3
GCSL reframes LLM fine-tuning as supervised pursuit of quality thresholds using natural-language goals, outperforming SFT and DPO on toxicity, code, and recommendation tasks.
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR cs.LG · 2026-05-07 · unverdicted · none · ref 24
RLVR exhibits implicit reward overfitting to training data and optimizes heavy-tailed singular spectra with rank-1 focus on reasoning capability.
A Systematic Post-Train Framework for Video Generation cs.CV · 2026-04-28 · unverdicted · none · ref 21
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Unlocking Proactivity in Task-Oriented Dialogue cs.AI · 2026-05-21 · unreviewed · ref 20
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification cs.CV · 2026-05-17 · unreviewed · ref 34
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective cs.LG · 2026-05-13 · unreviewed · ref 32 · 2 links

Direct preference optimization: Your language model is secretly a reward model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer