Online Distributionally Robust LLM Alignment via Regression to Relative Reward

Sharan Sahu , Martin T. Wells

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords robustalignmentrlhfdistributionallydro-dpodro-rebelexistinghuman

read the original abstract

Reinforcement Learning with Human Feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where language models degrade by overfitting inaccuracies and drifting from preferred behaviors observed during training. Distributionally robust optimization (DRO) is a natural solution, but existing DRO-DPO methods are sample-inefficient, ignore heterogeneous preferences, and lean on brittle heuristics. We introduce \emph{DRO-REBEL}, a family of robust online REBEL updates built on type-$p$ Wasserstein, Kullback-Leibler (KL), and $\chi^2$ ambiguity sets. Strong duality reduces each update to a relative-reward regression, retaining REBEL's scalability without PPO-style clipping or value networks. Under linear rewards, log-linear policies, and a standard coverage condition, we prove $\widetilde{O}(\sqrt{d/n})$ bounds on squared parameter error, with sharper constants than prior DRO-DPO analyses, and give the first parametric $\widetilde{O}(d/n)$ rate for DRO-based alignment under preference shift, matching non-robust RLHF in benign regimes. Each divergence yields a tractable SGD-based algorithm: gradient regularization for Wasserstein, importance weighting for KL, and a 1-D dual solve for $\chi^2$. On Emotion Alignment, the ArmoRM multi-objective benchmark, and HH-Alignment, DRO-REBEL outperforms prior robust and non-robust baselines across unseen preference mixtures, model sizes, and dataset scales.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
cs.CL 2026-05 unverdicted novelty 6.0

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.