A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.
(2025), DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature, 645, 633--638
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
stat.ML 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
Reinforcement Learning from Human Feedback: A Statistical Perspective
A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.