pith. machine review for the scientific record. sign in

arxiv: 2505.19770 · v5 · submitted 2025-05-26 · 💻 cs.LG · cs.CL

Recognition: unknown

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Authors on Pith no claims yet
classification 💻 cs.LG cs.CL
keywords rlhfmodeloptimizationrewardlearningperformancepolicyunder
0
0 comments X
read the original abstract

We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DDO-RM: Distribution-Level Policy Improvement after Reward Learning

    stat.ML 2026-04 unverdicted novelty 7.0

    DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.

  2. Reinforcement Learning from Human Feedback: A Statistical Perspective

    stat.ML 2026-04 accept novelty 2.0

    A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.