pith. sign in

arxiv: 2602.23116 · v3 · pith:HM2I5O7Cnew · submitted 2026-02-26 · 💻 cs.LG · cs.GT· stat.ML

Provably Efficient Regularized Online RLHF with Generalized Bilinear Preferences

classification 💻 cs.LG cs.GTstat.ML
keywords genericpreferencesregretunderbilinearemphfastgbpm
0
0 comments X
read the original abstract

We consider the problem of regularized best-response max-regret minimization in online RLHF under general preferences and bandit feedback. While various regularizers are utilized to robustify alignment, known polylogarithmic regret guarantees remain heavily specific to KL. To investigate whether such fast rates extend beyond KL, we adopt the Generalized Bilinear Preference Model (GBPM) -- capturing intransitive preferences over $d$-dimensional item-wise features via a rank-$2r$ skew-symmetric matrix -- to isolate the impact of generic regularization. Crucially, under GBPM, we prove that the dual gap of any greedy policy is bounded by the squared estimation error, derived using \emph{only} strong convexity and skew-symmetry. Under a feature coverage assumption, we establish a \emph{generic} polylogarithmic regret of $\tilde{\mathcal{O}}(\eta d^4 C_{\min}^{-1} (\log T)^2 \wedge d^2 C_{\min}^{-1/2} \sqrt{T})$ with Greedy Sampling, and a dimension-wise improved regret (for well-conditioned arm-sets) of $\tilde{\mathcal{O}}(C_{\min}^{-2} \sqrt{\eta r T} \wedge r^{1/3} C_{\min}^{-4/3} T^{2/3})$ with Explore-Then-Commit, where $\eta^{-1}$ is the regularization coefficient, $T$ is the time horizon, and $C_{\min}$ is an arm-set dependent quantity. This demonstrates that ``fast'' regrets are not KL-specific, but rather a fundamental consequence of generic strongly convex geometry.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Online KL-Regularized Reinforcement Learning with Function Approximation under Misspecification

    cs.LG 2026-06 unverdicted novelty 7.0

    Introduces KL misspecification for bandits and RL under function approximation and proves explicit KL-regret bounds for regression-based Gibbs algorithms that recover the realizable case.

  2. Efficient Exploration for Iterative Nash Preference Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    An explicitly exploratory iterative NLHF method achieves O(sqrt(T)) regret for Nash equilibria under general preference models, removing the exponential KL dependence that plagues standard iterative approaches.

  3. Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

    cs.LG 2026-05 unverdicted novelty 7.0

    The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...