Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Anikait Singh; Archit Sharma; Aviral Kumar; Chelsea Finn; Fahim Tajwar; Jeff Schneider; Rafael Rafailov; Stefano Ermon; Tengyang Xie

arxiv: 2404.14367 · v3 · pith:QW36TVJHnew · submitted 2024-04-22 · 💻 cs.LG

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Fahim Tajwar , Anikait Singh , Archit Sharma , Rafael Rafailov , Jeff Schneider , Tengyang Xie , Stefano Ermon , Chelsea Finn

show 1 more author

Aviral Kumar

This is my paper

classification 💻 cs.LG

keywords fine-tuningpreferencelearningon-policyapproachesdatadifferentlikelihood

0 comments

read the original abstract

Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively. Our analysis prescribes actionable insights for preference fine-tuning of LLMs and informs how data should be collected for maximal improvement.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement
cs.AI 2026-06 conditional novelty 7.0

Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.
Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs
cs.CV 2026-06 unverdicted novelty 6.0

ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.
Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
cs.LG 2026-06 unverdicted novelty 6.0

Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.
Consistency Training while Mitigating Obfuscation via Rate Matching
cs.CL 2026-06 unverdicted novelty 6.0

RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
cs.LG 2026-05 unverdicted novelty 6.0

ProteinOPD uses token-level on-policy distillation from multiple preference-specific teacher models into a shared student to balance competing objectives in protein design, delivering gains on targets without losing d...
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning polic...
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
cs.LG 2026-05 conditional novelty 6.0

Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies
cs.AI 2024-12 unverdicted novelty 6.0

PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.