pith. sign in

arxiv: 2410.15096 · v1 · pith:5P77XDVYnew · submitted 2024-10-19 · 💻 cs.AI

GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets

classification 💻 cs.AI
keywords humanpreferencegdpoofflinerewardalignmentdirectlygeneration
0
0 comments X
read the original abstract

A critical component of the current generation of language models is preference alignment, which aims to precisely control the model's behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning with Human Feedback (RLHF) and its offline variant Direct Preference Optimization (DPO), both of which seek to maximize a reward model based on human preferences. In particular, DPO derives reward signals directly from the offline preference data, but in doing so overfits the reward signals and generates suboptimal responses that may contain human biases in the dataset. In this work, we propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges. Empirical results show GDPO can generate far more diverse responses than the baseline methods that are still relatively aligned with human values in dialog generation and summarization tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 6.0

    Replaces scalar reward with a distribution over reward functions and applies a non-linear objective over action sets to induce controllable diversity in contextual bandit RL, generalizing policy gradient methods.