arxiv: 2604.02507 · v1 · submitted 2026-04-02 · 📊 stat.ML · cs.LG

Recognition: no theorem link

Reinforcement Learning from Human Feedback: A Statistical Perspective

Chengchun Shi, Pangpang Liu, Will Wei Sun

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords reinforcement learning from human feedbackreward modelingBradley-Terry-Luce modelLLM alignmentpolicy optimizationpreference datastatistical estimationdirect preference optimization

0 comments

The pith

RLHF models noisy human preferences as latent utilities to estimate rewards and align large language model policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines reinforcement learning from human feedback through statistical methods, showing how pairwise preference data are used to learn reward functions and optimize policies for aligning large language models. It links core RLHF steps to classical tools including the Bradley-Terry-Luce model for preference probabilities, latent utility estimation, active learning for data collection, and uncertainty quantification for robust optimization. The review covers both conventional two-stage pipelines and direct one-stage methods such as direct preference optimization, plus extensions to AI feedback and verifiable rewards, while identifying open problems in handling heterogeneous and subjective feedback.

Core claim

The paper claims that RLHF is best understood as a statistical estimation problem in which human preferences supply noisy observations of an underlying reward function, typically modeled by the Bradley-Terry-Luce framework, and that both two-stage and direct policy optimization methods can be analyzed and improved using principles from experimental design and uncertainty quantification.

What carries the argument

The Bradley-Terry-Luce model, which treats observed pairwise preferences as probabilistic outcomes driven by latent reward differences, supplies the mechanism for turning preference data into estimated reward functions that then guide policy optimization.

If this is right

Reward models estimated under Bradley-Terry-Luce assumptions produce policies whose performance scales with the volume and quality of preference data collected via active learning.
Direct preference optimization bypasses explicit reward modeling yet inherits the same statistical identifiability conditions as two-stage RLHF.
Uncertainty-aware data collection reduces the number of human queries needed to reach a target alignment level.
Extensions to verifiable rewards replace subjective preferences with objective correctness signals, altering the statistical estimation target from latent utilities to observable outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Statistical analysis of RLHF pipelines could guide the design of benchmarks that measure not only final performance but also the stability of learned rewards across different preference distributions.
Inference-time algorithms may allow dynamic adjustment of policies without full retraining, provided uncertainty estimates remain reliable at deployment.
Heterogeneous feedback across user groups suggests that mixture or hierarchical extensions to the Bradley-Terry-Luce model would be a natural next modeling step.

Load-bearing premise

That standard preference models such as Bradley-Terry-Luce and associated uncertainty methods are sufficient to capture the real variability and heterogeneity in human judgments.

What would settle it

A large-scale preference dataset in which the Bradley-Terry-Luce model yields systematically biased reward estimates that cannot be corrected by added noise terms or simple mixture extensions, leading to measurably worse policy alignment than alternative models.

Figures

Figures reproduced from arXiv: 2604.02507 by Chengchun Shi, Pangpang Liu, Will Wei Sun.

read the original abstract

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine-tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley-Terry-Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification. We review methods for learning reward functions from pairwise preference data and for optimizing policies through both two-stage RLHF pipelines and emerging one-stage approaches such as direct preference optimization. We further discuss recent extensions including reinforcement learning from AI feedback, inference-time algorithms, and reinforcement learning from verifiable rewards, as well as benchmark datasets, evaluation protocols, and open-source frameworks that support RLHF research. We conclude by highlighting open challenges in RLHF. An accompanying GitHub demo https://github.com/Pangpang-Liu/RLHF_demo illustrates key components of the RLHF pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes RLHF around standard statistical tools like BTL models without new theorems or experiments.

read the letter

This survey organizes RLHF around standard statistical tools like BTL models without new theorems or experiments. It walks through the usual pipeline of supervised fine-tuning, reward modeling from pairwise preferences, and policy optimization, then ties those steps to familiar ideas such as latent utility estimation, active learning, and uncertainty quantification. The sections on both classic two-stage RLHF and one-stage methods like direct preference optimization are clear and well-referenced. The authors also touch on extensions including RL from AI feedback and verifiable rewards, plus they list benchmarks and point to an accompanying GitHub demo that shows the components in practice. That demo is a practical addition for readers who want to try the ideas themselves. The paper flags open challenges at the end, including the difficulty of modeling heterogeneous human preferences, which keeps the tone honest. As a review it stays within existing literature, so there are no fresh derivations or empirical tests. The mapping to statistical concepts is accurate and does not introduce unsubstantiated claims, but the treatment of how well BTL-style models handle real-world preference noise stays at a high level without deeper analysis. This paper is aimed at readers who already know the basics of RLHF and want a statistical roadmap, or at statisticians who are curious about alignment work. It could serve as background reading or a reference for a course. It is not a major advance, but the synthesis is solid and the resources are useful. I would send it out for peer review.

Referee Report

0 major / 2 minor

Summary. This survey provides a statistical perspective on RLHF for aligning large language models. It introduces the core components—supervised fine-tuning, reward modeling from pairwise preferences, and policy optimization—and maps them to established statistical frameworks including the Bradley-Terry-Luce model, latent utility estimation, active learning, experimental design, and uncertainty quantification. The paper reviews both two-stage RLHF pipelines and one-stage alternatives such as direct preference optimization, covers extensions including RL from AI feedback, inference-time methods, and RL from verifiable rewards, and discusses benchmarks, evaluation protocols, and open-source tools. It concludes by highlighting open challenges in the area, accompanied by a GitHub demonstration repository.

Significance. If the mappings and reviews hold, the paper offers a timely synthesis that connects practical RLHF techniques to statistical principles for handling noisy and heterogeneous feedback. This organization can help researchers identify connections to active learning and uncertainty quantification, while the explicit discussion of open challenges and the accompanying demo support accessibility and future work in the field.

minor comments (2)

The abstract provides the GitHub demo URL, but repeating the exact link in the conclusion or a dedicated resources section would improve reader convenience.
In the review of reward modeling methods, ensure that the presentation of the BTL model includes a brief reminder of its key identifiability assumptions to aid readers less familiar with the statistical literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and the recommendation to accept. The review accurately captures the paper's focus on connecting RLHF components to statistical models such as the Bradley-Terry-Luce framework, active learning, and uncertainty quantification, as well as its coverage of both established pipelines and recent extensions.

Circularity Check

0 steps flagged

No significant circularity: survey organizes cited literature without new derivations

full rationale

This is a survey paper that introduces RLHF components and relates them to established statistical ideas (BTL model, active learning, uncertainty quantification) drawn from prior literature. It reviews methods for reward modeling and policy optimization, discusses extensions, and flags open challenges without presenting original theorems, proofs, fitted parameters, or predictions. No equations reduce to self-inputs by construction, no self-citations serve as load-bearing uniqueness claims, and no ansatzes or renamings are smuggled in. The central contribution is organizational and expository, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a survey and therefore relies on standard statistical models and assumptions already present in the reviewed literature without introducing new free parameters or invented entities.

axioms (1)

domain assumption Human preferences can be modeled using the Bradley-Terry-Luce model for pairwise comparisons
The survey explicitly relates reward modeling in RLHF to the BTL model as a core statistical tool.

pith-pipeline@v0.9.0 · 5517 in / 1185 out tokens · 47692 ms · 2026-05-13T20:07:21.340630+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Variance-aware Reward Modeling with Anchor Guidance
stat.ML 2026-05 unverdicted novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...
Perturbation is All You Need for Extrapolating Language Models
stat.ML 2026-05 unverdicted novelty 6.0

Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
cs.LG 2026-04 unverdicted novelty 6.0

DRRO for RLHF replaces worst-case value with worst-case regret in Wasserstein DRO, producing an exact water-filling solution under l1 ambiguity and a practical sampled-bonus algorithm that reduces proxy over-optimization.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 3 Pith papers · 12 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

work page
[3]

M., Rafailov, R., Sharkov, S., Li, X., and Koyejo, S

Ahmed, A. M., Rafailov, R., Sharkov, S., Li, X., and Koyejo, S. (2024), Scalable Ensembling For Mitigating Reward Overoptimisation, in ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models

work page 2024
[4]

R., Beirami, A., and Mroueh, Y

Aminian, G., Shenfeld, I., Asadi, A. R., Beirami, A., and Mroueh, Y. (2025), Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis, in ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page 2025
[5]

G., Guo, Z

Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. (2024), A general theoretical paradigm to understand learning from human preferences, in International Conference on Artificial Intelligence and Statistics, PMLR, pp. 4447--4455

work page 2024
[6]

Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., Zheng, B., et al. (2024), Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7421--7454

work page 2024
[7]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022 a ), Training a helpful and harmless assistant with reinforcement learning from human feedback, arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. (2022 b ), Constitutional AI: Harmlessness from AI feedback, arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

(2023), Active Reward Learning from Multiple Teachers, in The AAAI Workshop on Artificial Intelligence Safety

Barnett, P., Freedman, R., Svegliato, J., and Russell, S. (2023), Active Reward Learning from Multiple Teachers, in The AAAI Workshop on Artificial Intelligence Safety

work page 2023
[10]

Bradley, R. A. and Terry, M. E. (1952), Rank analysis of incomplete block designs: I. The method of paired comparisons, Biometrika, 39, 324--345

work page 1952
[11]

(2024), MaxMin- RLHF : Alignment with Diverse Human Preferences, in International Conference on Machine Learning, PMLR, pp

Chakraborty, S., Qiu, J., Yuan, H., Koppel, A., Manocha, D., Huang, F., Bedi, A., and Wang, M. (2024), MaxMin- RLHF : Alignment with Diverse Human Preferences, in International Conference on Machine Learning, PMLR, pp. 6116--6135

work page 2024
[12]

(2025), Towards Medical Complex Reasoning with LLM s through Medical Verifiable Problems, in Findings of the Association for Computational Linguistics: ACL 2025, eds

Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., and Wang, B. (2025), Towards Medical Complex Reasoning with LLM s through Medical Verifiable Problems, in Findings of the Association for Computational Linguistics: ACL 2025, eds. Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., Vienna, Austria: Association for Computational Linguistics, pp. 1...

work page 2025
[13]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021), Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Chittepu, Y., Metevier, B., Schwarzer, W., Hoag, A., Niekum, S., and Thomas, P. S. (2025), Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints, arXiv preprint arXiv:2506.08266

work page arXiv 2025
[15]

Cho, Y. H. and Sun, W. W. (2026), Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling, arXiv preprint arXiv:2603.22563

work page arXiv 2026
[16]

(2025), Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, in The Thirteenth International Conference on Learning Representations

Chow, Y., Tennenholtz, G., Gur, I., Zhuang, V., Dai, B., Kumar, A., Agarwal, R., Thiagarajan, S., Boutilier, C., and Faust, A. (2025), Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, in The Thirteenth International Conference on Learning Representations

work page 2025
[17]

(2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

Chu, X., Huang, H., Zhang, X., Wei, F., and Wang, Y. (2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

work page arXiv 2025
[18]

(2024), Reward Model Ensembles Help Mitigate Overoptimization, in The Twelfth International Conference on Learning Representations

Coste, T., Anwar, U., Kirk, R., and Krueger, D. (2024), Reward Model Ensembles Help Mitigate Overoptimization, in The Twelfth International Conference on Learning Representations

work page 2024
[19]

(2024), ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback, in Forty-first International Conference on Machine Learning

Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., and Sun, M. (2024), ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback, in Forty-first International Conference on Machine Learning

work page 2024
[20]

and Freedman, R

Daniels-Koch, O. and Freedman, R. (2022), The Expertise Problem: Learning from Specialized Feedback, in NeurIPS ML Safety Workshop

work page 2022
[21]

Das, N., Chakraborty, S., Pacchiano, A., and Chowdhury, S. R. (2024), Active Preference Optimization for Sample Efficient RLHF, arXiv preprint arXiv:2402.10500

work page arXiv 2024
[22]

(2026), Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals, arXiv preprint arXiv:2602.03061

Dong, Z., Zhang, Z., Zhou, Y., Jin, C., Wu, R., and Zhang, L. (2026), Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals, arXiv preprint arXiv:2602.03061

work page arXiv 2026
[23]

N., Dvijotham, K

Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D'Amour, A. N., Dvijotham, K. D., Fisch, A., Heller, K. A., Pfohl, S. R., Ramachandran, D., Shaw, P., and Berant, J. (2024), Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking, in First Conference on Language Modeling

work page 2024
[24]

(2024 a ), Covariate Assisted Entity Ranking with Sparse Intrinsic Scores, arXiv preprint arXiv:2407.08814

Fan, J., Hou, J., and Yu, M. (2024 a ), Covariate Assisted Entity Ranking with Sparse Intrinsic Scores, arXiv preprint arXiv:2407.08814

work page arXiv 2024
[25]

(2024 b ), Uncertainty quantification of MLE for entity ranking with covariates, Journal of Machine Learning Research, 25, 1--83

Fan, J., Hou, J., and Yu, M. (2024 b ), Uncertainty quantification of MLE for entity ranking with covariates, Journal of Machine Learning Research, 25, 1--83

work page 2024
[26]

(2025), Ranking inferences based on the top choice of multiway comparisons, Journal of the American Statistical Association, 120, 237--250

Fan, J., Lou, Z., Wang, W., and Yu, M. (2025), Ranking inferences based on the top choice of multiway comparisons, Journal of the American Statistical Association, 120, 237--250

work page 2025
[27]

Fang, H. et al. (2024), Efficiently Acquiring Human Feedback with Bayesian Deep Ranking for Preference Learning, in Proceedings of the 2024 Conference on Uncertainty in Natural Language Processing

work page 2024
[28]

Active teacher selection for reward learning

Freedman, R., Svegliato, J., Wray, K., and Russell, S. (2025), Active teacher selection for reinforcement learning from human feedback, arXiv preprint arXiv:2310.15288

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

N., Jiao, J., Zhu, B., Gonzalez, J

Frick, E., Li, T., Chen, C., Chiang, W.-L., Angelopoulos, A. N., Jiao, J., Zhu, B., Gonzalez, J. E., and Stoica, I. (2025), How to Evaluate Reward Models for RLHF , in The Thirteenth International Conference on Learning Representations

work page 2025
[30]

Gallego, V. (2024), Refined direct preference optimization with synthetic data for behavioral alignment of llms, in International Conference on Machine Learning, Optimization, and Data Science, Springer, pp. 92--105

work page 2024
[31]

Gao, C., Shen, Y., and Zhang, A. Y. (2023 a ), Uncertainty quantification in the Bradley--Terry--Luce model, Information and Inference: A Journal of the IMA, 12, 1073--1140

work page 2023
[32]

(2023 b ), Scaling laws for reward model overoptimization, in International Conference on Machine Learning, PMLR, pp

Gao, L., Schulman, J., and Hilton, J. (2023 b ), Scaling laws for reward model overoptimization, in International Conference on Machine Learning, PMLR, pp. 10835--10866

work page 2023
[33]

R., and Banerjee, D

Gopalan, A., Chowdhury, S. R., and Banerjee, D. (2025), Why DPO is a Misspecified Estimator and How to Fix It, arXiv preprint arXiv:2510.20413

work page arXiv 2025
[34]

(2024), A survey on LLM-as-a-judge, The Innovation

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. (2024), A survey on LLM-as-a-judge, The Innovation

work page 2024
[35]

(2024), Bo NB oN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

Gui, L., Garbacea, C., and Veitch, V. (2024), Bo NB oN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[36]

(2025), DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature, 645, 633--638

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025), DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature, 645, 633--638

work page 2025
[37]

(2020), Asymptotic theory of sparse Bradley--Terry model, The Annals of Applied Probability, 30, 2491--2515

Han, R., Ye, R., Tan, C., and Chen, K. (2020), Asymptotic theory of sparse Bradley--Terry model, The Annals of Applied Probability, 30, 2491--2515

work page 2020
[38]

(2023), Leveraging demonstrations to improve online learning: Quality matters, in International Conference on Machine Learning, PMLR, pp

Hao, B., Jain, R., Lattimore, T., Van Roy, B., and Wen, Z. (2023), Leveraging demonstrations to improve online learning: Quality matters, in International Conference on Machine Learning, PMLR, pp. 12527--12545

work page 2023
[39]

Havrilla, A., Zhuravinskyi, M., Phung, D., Tiwari, A., Tow, J., Biderman, S., Anthony, Q., and Castricato, L. (2023), trl X : A Framework for Large Scale Reinforcement Learning from Human Feedback, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, eds. Bouamor, H., Pino, J., and Bali, K., Singapore: Association for...

work page 2023
[40]

(2025), Evaluation of Best-of-N Sampling Strategies for Language Model Alignment, Transactions on Machine Learning Research

Ichihara, Y., Jinnai, Y., Morimura, T., Abe, K., Ariu, K., Sakamoto, M., and Uchibe, E. (2025), Evaluation of Best-of-N Sampling Strategies for Language Model Alignment, Transactions on Machine Learning Research

work page 2025
[41]

J., Milli, S., and Dragan, A

Jeon, H. J., Milli, S., and Dragan, A. (2020), Reward-rational (implicit) choice: A unifying formalism for reward learning, Advances in Neural Information Processing Systems, 33, 4415--4426

work page 2020
[42]

(2024), Reinforcement Learning from Human Feedback with Active Queries, arXiv preprint arXiv:2402.09401

Ji, K., He, J., and Gu, Q. (2024), Reinforcement Learning from Human Feedback with Active Queries, arXiv preprint arXiv:2402.09401

work page arXiv 2024
[43]

(2024), Regularized best-of- N sampling to mitigate reward hacking for language model alignment, in ICML 2024 Workshop on Models of Human Feedback for AI Alignment

Jinnai, Y., Morimura, T., Ariu, K., and Abe, K. (2024), Regularized best-of- N sampling to mitigate reward hacking for language model alignment, in ICML 2024 Workshop on Models of Human Feedback for AI Alignment

work page 2024
[44]

Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024), Prometheus 2: An open source language model specialized in evaluating other language models, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4334--4353

work page 2024
[45]

R., Whitefield, A., Rottger, P., Bean, A

Kirk, H. R., Whitefield, A., Rottger, P., Bean, A. M., Margatina, K., Mosquera-Gomez, R., Ciro, J., Bartolo, M., Williams, A., He, H., et al. (2024), The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models, Advances in Neural Informa...

work page 2024
[46]

Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, X., et al. (2025), Tulu 3: Pushing Frontiers in Open Language Model Post-Training, in Second Conference on Language Modeling

work page 2025
[47]

R., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S

Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K. R., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S. (2024 a ), RLAIF vs. RLHF : Scaling Reinforcement Learning from Human Feedback with AI Feedback, in Forty-first International Conference on Machine Learning

work page 2024
[48]

M., and Abbeel, P

Lee, K., Smith, L. M., and Abbeel, P. (2021), PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training, in International Conference on Machine Learning, PMLR, pp. 6152--6163

work page 2021
[49]

J., Sun, W

Lee, S. J., Sun, W. W., and Liu, Y. (2024 b ), Low-rank contextual reinforcement learning from heterogeneous human feedback, arXiv preprint arXiv:2412.19436

work page arXiv 2024
[50]

J., Krishna, S., and Lakkaraju, H

Li, A. J., Krishna, S., and Lakkaraju, H. (2025), More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness, in The Thirteenth International Conference on Learning Representations

work page 2025
[51]

and Li, S

Li, X. and Li, S. (2025), Efficient Inference for Covariate-adjusted Bradley-Terry Model with Covariate Shift, arXiv preprint arXiv:2503.18256

work page arXiv 2025
[52]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023), AlpacaEval: An Automatic Evaluator of Instruction-following Models, https://github.com/tatsu-lab/alpaca_eval

work page 2023
[53]

Liu, P., Lu, J., and Sun, W. W. (2025 a ), Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback, arXiv preprint arXiv:2512.03208

work page arXiv 2025
[54]

Liu, P., Shi, C., and Sun, W. W. (2024), Dual active learning for reinforcement learning from human feedback, arXiv preprint arXiv:2410.02504

work page arXiv 2024
[55]

X., and Lu, J

Liu, Y., Fang, E. X., and Lu, J. (2023), Lagrangian inference for ranking problems, Operations research, 71, 202--223

work page 2023
[56]

(2025 b ), Pairjudge RM : Perform best-of- N sampling with knockout tournament, arXiv preprint arXiv:2501.13007

Liu, Y., Yao, Z., Min, R., Cao, Y., Hou, L., and Li, J. (2025 b ), Pairjudge RM : Perform best-of- N sampling with knockout tournament, arXiv preprint arXiv:2501.13007

work page arXiv 2025
[57]

S., and Lin, M

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. (2025 c ), Understanding R1-Zero-Like Training: A Critical Perspective, in Second Conference on Language Modeling

work page 2025
[58]

Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

Lu, N., Fang, E. X., and Lu, J. (2025), Contextual Online Uncertainty-Aware Preference Learning for Human Feedback, arXiv preprint arXiv:2504.19342

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Melo, L. C. et al. (2024), Deep Bayesian Active Learning for Preference Modeling in Language Models, in Advances in Neural Information Processing Systems

work page 2024
[60]

(2024), Optimal Design for Human Preference Elicitation, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

Mukherjee, S., Lalitha, A., Kalantari, K., Deshmukh, A., Liu, G., Ma, Y., and Kveton, B. (2024), Optimal Design for Human Preference Elicitation, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[61]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2022), Webgpt: Browser-assisted question-answering with human feedback, arXiv preprint arXiv:2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

(2026), Audit Trails for Accountability in Large Language Models, arXiv preprint arXiv:2601.20727

Ojewale, V., Suresh, H., and Venkatasubramanian, S. (2026), Audit Trails for Accountability in Large Language Models, arXiv preprint arXiv:2601.20727

work page arXiv 2026
[63]

(2022), Training language models to follow instructions with human feedback, Advances in neural information processing systems, 35, 27730--27744

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022), Training language models to follow instructions with human feedback, Advances in neural information processing systems, 35, 27730--27744

work page 2022
[64]

(2025), Towards Reward Fairness in RLHF: From a Resource Allocation Perspective, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

Ouyang, S., Hu, Y., Chen, G., Li, Q., Zhang, F., and Liu, Y. (2025), Towards Reward Fairness in RLHF: From a Resource Allocation Perspective, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

work page 2025
[65]

(2022), The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models, in International Conference on Learning Representations

Pan, A., Bhatia, K., and Steinhardt, J. (2022), The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models, in International Conference on Learning Representations

work page 2022
[66]

Park, C., Liu, M., Kong, D., Zhang, K., and Ozdaglar, A. E. (2024), RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation, in ICML 2024 Workshop on Theoretical Foundations of Foundation Models

work page 2024
[67]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023), Direct Preference Optimization: Your Language Model is Secretly a Reward Model, in Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[68]

(2024), WARM : On the Benefits of Weight Averaged Reward Models, in Forty-first International Conference on Machine Learning

Rame, A., Vieillard, N., Hussenot, L., Dadashi, R., Cideron, G., Bachem, O., and Ferret, J. (2024), WARM : On the Benefits of Weight Averaged Reward Models, in Forty-first International Conference on Machine Learning

work page 2024
[69]

I., M \'e nard, P., Moulines, E., and Valko, M

Scheid, A., Boursier, E., Durmus, A., Jordan, M. I., M \'e nard, P., Moulines, E., and Valko, M. (2024), Optimal design for reward modeling in RLHF , arXiv preprint arXiv:2410.17055

work page arXiv 2024
[70]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017), Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[71]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. (2024), Deepseekmath: Pushing the limits of mathematical reasoning in open language models, arXiv preprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Shi, R., Song, M., Zhou, R., Zhang, Z., Fazel, M., and Du, S. S. (2025), Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO, arXiv preprint arXiv:2505.19770

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

and Yao, Y.-C

Simons, G. and Yao, Y.-C. (1999), Asymptotics when the number of parameters tends to infinity in the Bradley-Terry model for paired comparisons, The Annals of Statistics, 27, 1041--1060

work page 1999
[74]

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. (2020), Learning to summarize with human feedback, Advances in neural information processing systems, 33, 3008--3021

work page 2020
[75]

Sutton, R. S. and Barto, A. G. (2018), Reinforcement Learning: An Introduction, MIT Press

work page 2018
[76]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023), Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017), Attention is all you need, Advances in neural information processing systems, 30

work page 2017
[78]

(2020), TRL: Transformers Reinforcement Learning ,

von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. (2020), TRL: Transformers Reinforcement Learning ,

work page 2020
[79]

(2025 a ), MPO : An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, arXiv preprint arXiv:2502.18699

Wang, T., Gui, D., Hu, Y., Lin, S., and Zhang, L. (2025 a ), MPO : An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, arXiv preprint arXiv:2502.18699

work page arXiv 2025
[80]

X., Wang, L., and Lu, J

Wang, Z., Han, Y., Fang, E. X., Wang, L., and Lu, J. (2025 b ), Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation, arXiv preprint arXiv:2412.05506

work page arXiv 2025

Showing first 80 references.