pith. machine review for the scientific record. sign in

arxiv: 2504.19342 · v3 · submitted 2025-04-27 · 📊 stat.ML · cs.LG· stat.ME

Recognition: unknown

Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

Ethan Lee, Ethan X. Fang, Junwei Lu, Nan Lu

Authors on Pith no claims yet
classification 📊 stat.ML cs.LGstat.ME
keywords humanpreferencelanguagelargemodelsonlineaspectasymptotic
0
0 comments X
read the original abstract

Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm in artificial intelligence to align large models with human preferences. In this paper, we propose a novel statistical framework to simultaneously conduct the online decision-making and statistical inference on the optimal model using human preference data based on dynamic contextual information. Our approach introduces an efficient decision strategy that achieves both the optimal regret bound and the asymptotic distribution of the estimators. A key challenge in RLHF is handling the dependent online human preference outcomes with dynamic contexts. To address this, in the methodological aspect, we propose a two-stage algorithm starting with $\epsilon$-greedy followed by exploitations; in the theoretical aspect, we tailor anti-concentration inequalities and matrix martingale concentration techniques to derive the uniform estimation rate and asymptotic normality of the estimators using dependent samples from both stages. Extensive simulation results demonstrate that our method outperforms state-of-the-art strategies. We apply the proposed framework to analyze the human preference data for ranking large language models on the Massive Multitask Language Understanding dataset, yielding insightful results on the performance of different large language models for medical anatomy knowledge.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Perturbation is All You Need for Extrapolating Language Models

    stat.ML 2026-05 unverdicted novelty 6.0

    Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.

  2. Reinforcement Learning from Human Feedback: A Statistical Perspective

    stat.ML 2026-04 accept novelty 2.0

    A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.