pith. machine review for the scientific record. sign in

arxiv: 2509.19104 · v2 · submitted 2025-09-23 · 💻 cs.LG · stat.ML

Recognition: unknown

Online Distributionally Robust LLM Alignment via Regression to Relative Reward

Authors on Pith no claims yet
classification 💻 cs.LG stat.ML
keywords robustalignmentrlhfdistributionallydro-dpodro-rebelexistinghuman
0
0 comments X
read the original abstract

Reinforcement Learning with Human Feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where language models degrade by overfitting inaccuracies and drifting from preferred behaviors observed during training. Distributionally robust optimization (DRO) is a natural solution, but existing DRO-DPO methods are sample-inefficient, ignore heterogeneous preferences, and lean on brittle heuristics. We introduce \emph{DRO-REBEL}, a family of robust online REBEL updates built on type-$p$ Wasserstein, Kullback-Leibler (KL), and $\chi^2$ ambiguity sets. Strong duality reduces each update to a relative-reward regression, retaining REBEL's scalability without PPO-style clipping or value networks. Under linear rewards, log-linear policies, and a standard coverage condition, we prove $\widetilde{O}(\sqrt{d/n})$ bounds on squared parameter error, with sharper constants than prior DRO-DPO analyses, and give the first parametric $\widetilde{O}(d/n)$ rate for DRO-based alignment under preference shift, matching non-robust RLHF in benign regimes. Each divergence yields a tractable SGD-based algorithm: gradient regularization for Wasserstein, importance weighting for KL, and a 1-D dual solve for $\chi^2$. On Emotion Alignment, the ArmoRM multi-objective benchmark, and HH-Alignment, DRO-REBEL outperforms prior robust and non-robust baselines across unseen preference mixtures, model sizes, and dataset scales.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

    cs.CL 2026-05 unverdicted novelty 6.0

    A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...

  2. Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

    cs.CL 2026-05 unverdicted novelty 6.0

    A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.