pith. machine review for the scientific record. sign in

arxiv: 2604.02507 · v1 · submitted 2026-04-02 · 📊 stat.ML · cs.LG

Recognition: no theorem link

Reinforcement Learning from Human Feedback: A Statistical Perspective

Chengchun Shi, Pangpang Liu, Will Wei Sun

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords reinforcement learning from human feedbackreward modelingBradley-Terry-Luce modelLLM alignmentpolicy optimizationpreference datastatistical estimationdirect preference optimization
0
0 comments X

The pith

RLHF models noisy human preferences as latent utilities to estimate rewards and align large language model policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines reinforcement learning from human feedback through statistical methods, showing how pairwise preference data are used to learn reward functions and optimize policies for aligning large language models. It links core RLHF steps to classical tools including the Bradley-Terry-Luce model for preference probabilities, latent utility estimation, active learning for data collection, and uncertainty quantification for robust optimization. The review covers both conventional two-stage pipelines and direct one-stage methods such as direct preference optimization, plus extensions to AI feedback and verifiable rewards, while identifying open problems in handling heterogeneous and subjective feedback.

Core claim

The paper claims that RLHF is best understood as a statistical estimation problem in which human preferences supply noisy observations of an underlying reward function, typically modeled by the Bradley-Terry-Luce framework, and that both two-stage and direct policy optimization methods can be analyzed and improved using principles from experimental design and uncertainty quantification.

What carries the argument

The Bradley-Terry-Luce model, which treats observed pairwise preferences as probabilistic outcomes driven by latent reward differences, supplies the mechanism for turning preference data into estimated reward functions that then guide policy optimization.

If this is right

  • Reward models estimated under Bradley-Terry-Luce assumptions produce policies whose performance scales with the volume and quality of preference data collected via active learning.
  • Direct preference optimization bypasses explicit reward modeling yet inherits the same statistical identifiability conditions as two-stage RLHF.
  • Uncertainty-aware data collection reduces the number of human queries needed to reach a target alignment level.
  • Extensions to verifiable rewards replace subjective preferences with objective correctness signals, altering the statistical estimation target from latent utilities to observable outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Statistical analysis of RLHF pipelines could guide the design of benchmarks that measure not only final performance but also the stability of learned rewards across different preference distributions.
  • Inference-time algorithms may allow dynamic adjustment of policies without full retraining, provided uncertainty estimates remain reliable at deployment.
  • Heterogeneous feedback across user groups suggests that mixture or hierarchical extensions to the Bradley-Terry-Luce model would be a natural next modeling step.

Load-bearing premise

That standard preference models such as Bradley-Terry-Luce and associated uncertainty methods are sufficient to capture the real variability and heterogeneity in human judgments.

What would settle it

A large-scale preference dataset in which the Bradley-Terry-Luce model yields systematically biased reward estimates that cannot be corrected by added noise terms or simple mixture extensions, leading to measurably worse policy alignment than alternative models.

Figures

Figures reproduced from arXiv: 2604.02507 by Chengchun Shi, Pangpang Liu, Will Wei Sun.

Figure 1
Figure 1. Figure 1: RLHF pipeline for aligning large language models [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
read the original abstract

Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine-tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley-Terry-Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification. We review methods for learning reward functions from pairwise preference data and for optimizing policies through both two-stage RLHF pipelines and emerging one-stage approaches such as direct preference optimization. We further discuss recent extensions including reinforcement learning from AI feedback, inference-time algorithms, and reinforcement learning from verifiable rewards, as well as benchmark datasets, evaluation protocols, and open-source frameworks that support RLHF research. We conclude by highlighting open challenges in RLHF. An accompanying GitHub demo https://github.com/Pangpang-Liu/RLHF_demo illustrates key components of the RLHF pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. This survey provides a statistical perspective on RLHF for aligning large language models. It introduces the core components—supervised fine-tuning, reward modeling from pairwise preferences, and policy optimization—and maps them to established statistical frameworks including the Bradley-Terry-Luce model, latent utility estimation, active learning, experimental design, and uncertainty quantification. The paper reviews both two-stage RLHF pipelines and one-stage alternatives such as direct preference optimization, covers extensions including RL from AI feedback, inference-time methods, and RL from verifiable rewards, and discusses benchmarks, evaluation protocols, and open-source tools. It concludes by highlighting open challenges in the area, accompanied by a GitHub demonstration repository.

Significance. If the mappings and reviews hold, the paper offers a timely synthesis that connects practical RLHF techniques to statistical principles for handling noisy and heterogeneous feedback. This organization can help researchers identify connections to active learning and uncertainty quantification, while the explicit discussion of open challenges and the accompanying demo support accessibility and future work in the field.

minor comments (2)
  1. The abstract provides the GitHub demo URL, but repeating the exact link in the conclusion or a dedicated resources section would improve reader convenience.
  2. In the review of reward modeling methods, ensure that the presentation of the BTL model includes a brief reminder of its key identifiability assumptions to aid readers less familiar with the statistical literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and the recommendation to accept. The review accurately captures the paper's focus on connecting RLHF components to statistical models such as the Bradley-Terry-Luce framework, active learning, and uncertainty quantification, as well as its coverage of both established pipelines and recent extensions.

Circularity Check

0 steps flagged

No significant circularity: survey organizes cited literature without new derivations

full rationale

This is a survey paper that introduces RLHF components and relates them to established statistical ideas (BTL model, active learning, uncertainty quantification) drawn from prior literature. It reviews methods for reward modeling and policy optimization, discusses extensions, and flags open challenges without presenting original theorems, proofs, fitted parameters, or predictions. No equations reduce to self-inputs by construction, no self-citations serve as load-bearing uniqueness claims, and no ansatzes or renamings are smuggled in. The central contribution is organizational and expository, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a survey and therefore relies on standard statistical models and assumptions already present in the reviewed literature without introducing new free parameters or invented entities.

axioms (1)
  • domain assumption Human preferences can be modeled using the Bradley-Terry-Luce model for pairwise comparisons
    The survey explicitly relates reward modeling in RLHF to the BTL model as a core statistical tool.

pith-pipeline@v0.9.0 · 5517 in / 1185 out tokens · 47692 ms · 2026-05-13T20:07:21.340630+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  2. Perturbation is All You Need for Extrapolating Language Models

    stat.ML 2026-05 unverdicted novelty 6.0

    Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.

  3. Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

    cs.LG 2026-04 unverdicted novelty 6.0

    DRRO for RLHF replaces worst-case value with worst-case regret in Wasserstein DRO, producing an exact water-filling solution under l1 ambiguity and a practical sampled-bonus algorithm that reduces proxy over-optimization.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 3 Pith papers · 12 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....

  3. [3]

    M., Rafailov, R., Sharkov, S., Li, X., and Koyejo, S

    Ahmed, A. M., Rafailov, R., Sharkov, S., Li, X., and Koyejo, S. (2024), Scalable Ensembling For Mitigating Reward Overoptimisation, in ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models

  4. [4]

    R., Beirami, A., and Mroueh, Y

    Aminian, G., Shenfeld, I., Asadi, A. R., Beirami, A., and Mroueh, Y. (2025), Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis, in ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

  5. [5]

    G., Guo, Z

    Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. (2024), A general theoretical paradigm to understand learning from human preferences, in International Conference on Artificial Intelligence and Statistics, PMLR, pp. 4447--4455

  6. [6]

    Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., Zheng, B., et al. (2024), Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7421--7454

  7. [7]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022 a ), Training a helpful and harmless assistant with reinforcement learning from human feedback, arXiv preprint arXiv:2204.05862

  8. [8]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. (2022 b ), Constitutional AI: Harmlessness from AI feedback, arXiv preprint arXiv:2212.08073

  9. [9]

    (2023), Active Reward Learning from Multiple Teachers, in The AAAI Workshop on Artificial Intelligence Safety

    Barnett, P., Freedman, R., Svegliato, J., and Russell, S. (2023), Active Reward Learning from Multiple Teachers, in The AAAI Workshop on Artificial Intelligence Safety

  10. [10]

    Bradley, R. A. and Terry, M. E. (1952), Rank analysis of incomplete block designs: I. The method of paired comparisons, Biometrika, 39, 324--345

  11. [11]

    (2024), MaxMin- RLHF : Alignment with Diverse Human Preferences, in International Conference on Machine Learning, PMLR, pp

    Chakraborty, S., Qiu, J., Yuan, H., Koppel, A., Manocha, D., Huang, F., Bedi, A., and Wang, M. (2024), MaxMin- RLHF : Alignment with Diverse Human Preferences, in International Conference on Machine Learning, PMLR, pp. 6116--6135

  12. [12]

    (2025), Towards Medical Complex Reasoning with LLM s through Medical Verifiable Problems, in Findings of the Association for Computational Linguistics: ACL 2025, eds

    Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., and Wang, B. (2025), Towards Medical Complex Reasoning with LLM s through Medical Verifiable Problems, in Findings of the Association for Computational Linguistics: ACL 2025, eds. Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., Vienna, Austria: Association for Computational Linguistics, pp. 1...

  13. [13]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021), Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374

  14. [14]

    Chittepu, Y., Metevier, B., Schwarzer, W., Hoag, A., Niekum, S., and Thomas, P. S. (2025), Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints, arXiv preprint arXiv:2506.08266

  15. [15]

    Cho, Y. H. and Sun, W. W. (2026), Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Modeling, arXiv preprint arXiv:2603.22563

  16. [16]

    (2025), Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, in The Thirteenth International Conference on Learning Representations

    Chow, Y., Tennenholtz, G., Gur, I., Zhuang, V., Dai, B., Kumar, A., Agarwal, R., Thiagarajan, S., Boutilier, C., and Faust, A. (2025), Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, in The Thirteenth International Conference on Learning Representations

  17. [17]

    (2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

    Chu, X., Huang, H., Zhang, X., Wei, F., and Wang, Y. (2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

  18. [18]

    (2024), Reward Model Ensembles Help Mitigate Overoptimization, in The Twelfth International Conference on Learning Representations

    Coste, T., Anwar, U., Kirk, R., and Krueger, D. (2024), Reward Model Ensembles Help Mitigate Overoptimization, in The Twelfth International Conference on Learning Representations

  19. [19]

    (2024), ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback, in Forty-first International Conference on Machine Learning

    Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., and Sun, M. (2024), ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback, in Forty-first International Conference on Machine Learning

  20. [20]

    and Freedman, R

    Daniels-Koch, O. and Freedman, R. (2022), The Expertise Problem: Learning from Specialized Feedback, in NeurIPS ML Safety Workshop

  21. [21]

    Das, N., Chakraborty, S., Pacchiano, A., and Chowdhury, S. R. (2024), Active Preference Optimization for Sample Efficient RLHF, arXiv preprint arXiv:2402.10500

  22. [22]

    (2026), Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals, arXiv preprint arXiv:2602.03061

    Dong, Z., Zhang, Z., Zhou, Y., Jin, C., Wu, R., and Zhang, L. (2026), Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals, arXiv preprint arXiv:2602.03061

  23. [23]

    N., Dvijotham, K

    Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D'Amour, A. N., Dvijotham, K. D., Fisch, A., Heller, K. A., Pfohl, S. R., Ramachandran, D., Shaw, P., and Berant, J. (2024), Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking, in First Conference on Language Modeling

  24. [24]

    (2024 a ), Covariate Assisted Entity Ranking with Sparse Intrinsic Scores, arXiv preprint arXiv:2407.08814

    Fan, J., Hou, J., and Yu, M. (2024 a ), Covariate Assisted Entity Ranking with Sparse Intrinsic Scores, arXiv preprint arXiv:2407.08814

  25. [25]

    (2024 b ), Uncertainty quantification of MLE for entity ranking with covariates, Journal of Machine Learning Research, 25, 1--83

    Fan, J., Hou, J., and Yu, M. (2024 b ), Uncertainty quantification of MLE for entity ranking with covariates, Journal of Machine Learning Research, 25, 1--83

  26. [26]

    (2025), Ranking inferences based on the top choice of multiway comparisons, Journal of the American Statistical Association, 120, 237--250

    Fan, J., Lou, Z., Wang, W., and Yu, M. (2025), Ranking inferences based on the top choice of multiway comparisons, Journal of the American Statistical Association, 120, 237--250

  27. [27]

    Fang, H. et al. (2024), Efficiently Acquiring Human Feedback with Bayesian Deep Ranking for Preference Learning, in Proceedings of the 2024 Conference on Uncertainty in Natural Language Processing

  28. [28]

    Active teacher selection for reward learning

    Freedman, R., Svegliato, J., Wray, K., and Russell, S. (2025), Active teacher selection for reinforcement learning from human feedback, arXiv preprint arXiv:2310.15288

  29. [29]

    N., Jiao, J., Zhu, B., Gonzalez, J

    Frick, E., Li, T., Chen, C., Chiang, W.-L., Angelopoulos, A. N., Jiao, J., Zhu, B., Gonzalez, J. E., and Stoica, I. (2025), How to Evaluate Reward Models for RLHF , in The Thirteenth International Conference on Learning Representations

  30. [30]

    Gallego, V. (2024), Refined direct preference optimization with synthetic data for behavioral alignment of llms, in International Conference on Machine Learning, Optimization, and Data Science, Springer, pp. 92--105

  31. [31]

    Gao, C., Shen, Y., and Zhang, A. Y. (2023 a ), Uncertainty quantification in the Bradley--Terry--Luce model, Information and Inference: A Journal of the IMA, 12, 1073--1140

  32. [32]

    (2023 b ), Scaling laws for reward model overoptimization, in International Conference on Machine Learning, PMLR, pp

    Gao, L., Schulman, J., and Hilton, J. (2023 b ), Scaling laws for reward model overoptimization, in International Conference on Machine Learning, PMLR, pp. 10835--10866

  33. [33]

    R., and Banerjee, D

    Gopalan, A., Chowdhury, S. R., and Banerjee, D. (2025), Why DPO is a Misspecified Estimator and How to Fix It, arXiv preprint arXiv:2510.20413

  34. [34]

    (2024), A survey on LLM-as-a-judge, The Innovation

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. (2024), A survey on LLM-as-a-judge, The Innovation

  35. [35]

    (2024), Bo NB oN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

    Gui, L., Garbacea, C., and Veitch, V. (2024), Bo NB oN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

  36. [36]

    (2025), DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature, 645, 633--638

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025), DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature, 645, 633--638

  37. [37]

    (2020), Asymptotic theory of sparse Bradley--Terry model, The Annals of Applied Probability, 30, 2491--2515

    Han, R., Ye, R., Tan, C., and Chen, K. (2020), Asymptotic theory of sparse Bradley--Terry model, The Annals of Applied Probability, 30, 2491--2515

  38. [38]

    (2023), Leveraging demonstrations to improve online learning: Quality matters, in International Conference on Machine Learning, PMLR, pp

    Hao, B., Jain, R., Lattimore, T., Van Roy, B., and Wen, Z. (2023), Leveraging demonstrations to improve online learning: Quality matters, in International Conference on Machine Learning, PMLR, pp. 12527--12545

  39. [39]

    Havrilla, A., Zhuravinskyi, M., Phung, D., Tiwari, A., Tow, J., Biderman, S., Anthony, Q., and Castricato, L. (2023), trl X : A Framework for Large Scale Reinforcement Learning from Human Feedback, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, eds. Bouamor, H., Pino, J., and Bali, K., Singapore: Association for...

  40. [40]

    (2025), Evaluation of Best-of-N Sampling Strategies for Language Model Alignment, Transactions on Machine Learning Research

    Ichihara, Y., Jinnai, Y., Morimura, T., Abe, K., Ariu, K., Sakamoto, M., and Uchibe, E. (2025), Evaluation of Best-of-N Sampling Strategies for Language Model Alignment, Transactions on Machine Learning Research

  41. [41]

    J., Milli, S., and Dragan, A

    Jeon, H. J., Milli, S., and Dragan, A. (2020), Reward-rational (implicit) choice: A unifying formalism for reward learning, Advances in Neural Information Processing Systems, 33, 4415--4426

  42. [42]

    (2024), Reinforcement Learning from Human Feedback with Active Queries, arXiv preprint arXiv:2402.09401

    Ji, K., He, J., and Gu, Q. (2024), Reinforcement Learning from Human Feedback with Active Queries, arXiv preprint arXiv:2402.09401

  43. [43]

    (2024), Regularized best-of- N sampling to mitigate reward hacking for language model alignment, in ICML 2024 Workshop on Models of Human Feedback for AI Alignment

    Jinnai, Y., Morimura, T., Ariu, K., and Abe, K. (2024), Regularized best-of- N sampling to mitigate reward hacking for language model alignment, in ICML 2024 Workshop on Models of Human Feedback for AI Alignment

  44. [44]

    Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M

    Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024), Prometheus 2: An open source language model specialized in evaluating other language models, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4334--4353

  45. [45]

    R., Whitefield, A., Rottger, P., Bean, A

    Kirk, H. R., Whitefield, A., Rottger, P., Bean, A. M., Margatina, K., Mosquera-Gomez, R., Ciro, J., Bartolo, M., Williams, A., He, H., et al. (2024), The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models, Advances in Neural Informa...

  46. [46]

    Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, X., et al. (2025), Tulu 3: Pushing Frontiers in Open Language Model Post-Training, in Second Conference on Language Modeling

  47. [47]

    R., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S

    Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K. R., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S. (2024 a ), RLAIF vs. RLHF : Scaling Reinforcement Learning from Human Feedback with AI Feedback, in Forty-first International Conference on Machine Learning

  48. [48]

    M., and Abbeel, P

    Lee, K., Smith, L. M., and Abbeel, P. (2021), PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training, in International Conference on Machine Learning, PMLR, pp. 6152--6163

  49. [49]

    J., Sun, W

    Lee, S. J., Sun, W. W., and Liu, Y. (2024 b ), Low-rank contextual reinforcement learning from heterogeneous human feedback, arXiv preprint arXiv:2412.19436

  50. [50]

    J., Krishna, S., and Lakkaraju, H

    Li, A. J., Krishna, S., and Lakkaraju, H. (2025), More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness, in The Thirteenth International Conference on Learning Representations

  51. [51]

    and Li, S

    Li, X. and Li, S. (2025), Efficient Inference for Covariate-adjusted Bradley-Terry Model with Covariate Shift, arXiv preprint arXiv:2503.18256

  52. [52]

    Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023), AlpacaEval: An Automatic Evaluator of Instruction-following Models, https://github.com/tatsu-lab/alpaca_eval

  53. [53]

    Liu, P., Lu, J., and Sun, W. W. (2025 a ), Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback, arXiv preprint arXiv:2512.03208

  54. [54]

    Liu, P., Shi, C., and Sun, W. W. (2024), Dual active learning for reinforcement learning from human feedback, arXiv preprint arXiv:2410.02504

  55. [55]

    X., and Lu, J

    Liu, Y., Fang, E. X., and Lu, J. (2023), Lagrangian inference for ranking problems, Operations research, 71, 202--223

  56. [56]

    (2025 b ), Pairjudge RM : Perform best-of- N sampling with knockout tournament, arXiv preprint arXiv:2501.13007

    Liu, Y., Yao, Z., Min, R., Cao, Y., Hou, L., and Li, J. (2025 b ), Pairjudge RM : Perform best-of- N sampling with knockout tournament, arXiv preprint arXiv:2501.13007

  57. [57]

    S., and Lin, M

    Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. (2025 c ), Understanding R1-Zero-Like Training: A Critical Perspective, in Second Conference on Language Modeling

  58. [58]

    Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

    Lu, N., Fang, E. X., and Lu, J. (2025), Contextual Online Uncertainty-Aware Preference Learning for Human Feedback, arXiv preprint arXiv:2504.19342

  59. [59]

    Melo, L. C. et al. (2024), Deep Bayesian Active Learning for Preference Modeling in Language Models, in Advances in Neural Information Processing Systems

  60. [60]

    (2024), Optimal Design for Human Preference Elicitation, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

    Mukherjee, S., Lalitha, A., Kalantari, K., Deshmukh, A., Liu, G., Ma, Y., and Kveton, B. (2024), Optimal Design for Human Preference Elicitation, in The Thirty-eighth Annual Conference on Neural Information Processing Systems

  61. [61]

    WebGPT: Browser-assisted question-answering with human feedback

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2022), Webgpt: Browser-assisted question-answering with human feedback, arXiv preprint arXiv:2112.09332

  62. [62]

    (2026), Audit Trails for Accountability in Large Language Models, arXiv preprint arXiv:2601.20727

    Ojewale, V., Suresh, H., and Venkatasubramanian, S. (2026), Audit Trails for Accountability in Large Language Models, arXiv preprint arXiv:2601.20727

  63. [63]

    (2022), Training language models to follow instructions with human feedback, Advances in neural information processing systems, 35, 27730--27744

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022), Training language models to follow instructions with human feedback, Advances in neural information processing systems, 35, 27730--27744

  64. [64]

    (2025), Towards Reward Fairness in RLHF: From a Resource Allocation Perspective, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

    Ouyang, S., Hu, Y., Chen, G., Li, Q., Zhang, F., and Liu, Y. (2025), Towards Reward Fairness in RLHF: From a Resource Allocation Perspective, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

  65. [65]

    (2022), The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models, in International Conference on Learning Representations

    Pan, A., Bhatia, K., and Steinhardt, J. (2022), The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models, in International Conference on Learning Representations

  66. [66]

    Park, C., Liu, M., Kong, D., Zhang, K., and Ozdaglar, A. E. (2024), RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation, in ICML 2024 Workshop on Theoretical Foundations of Foundation Models

  67. [67]

    D., Ermon, S., and Finn, C

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023), Direct Preference Optimization: Your Language Model is Secretly a Reward Model, in Thirty-seventh Conference on Neural Information Processing Systems

  68. [68]

    (2024), WARM : On the Benefits of Weight Averaged Reward Models, in Forty-first International Conference on Machine Learning

    Rame, A., Vieillard, N., Hussenot, L., Dadashi, R., Cideron, G., Bachem, O., and Ferret, J. (2024), WARM : On the Benefits of Weight Averaged Reward Models, in Forty-first International Conference on Machine Learning

  69. [69]

    I., M \'e nard, P., Moulines, E., and Valko, M

    Scheid, A., Boursier, E., Durmus, A., Jordan, M. I., M \'e nard, P., Moulines, E., and Valko, M. (2024), Optimal design for reward modeling in RLHF , arXiv preprint arXiv:2410.17055

  70. [70]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017), Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347

  71. [71]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. (2024), Deepseekmath: Pushing the limits of mathematical reasoning in open language models, arXiv preprint arXiv:2402.03300

  72. [72]

    Shi, R., Song, M., Zhou, R., Zhang, Z., Fazel, M., and Du, S. S. (2025), Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO, arXiv preprint arXiv:2505.19770

  73. [73]

    and Yao, Y.-C

    Simons, G. and Yao, Y.-C. (1999), Asymptotics when the number of parameters tends to infinity in the Bradley-Terry model for paired comparisons, The Annals of Statistics, 27, 1041--1060

  74. [74]

    Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. (2020), Learning to summarize with human feedback, Advances in neural information processing systems, 33, 3008--3021

  75. [75]

    Sutton, R. S. and Barto, A. G. (2018), Reinforcement Learning: An Introduction, MIT Press

  76. [76]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023), Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288

  77. [77]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017), Attention is all you need, Advances in neural information processing systems, 30

  78. [78]

    (2020), TRL: Transformers Reinforcement Learning ,

    von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. (2020), TRL: Transformers Reinforcement Learning ,

  79. [79]

    (2025 a ), MPO : An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, arXiv preprint arXiv:2502.18699

    Wang, T., Gui, D., Hu, Y., Lin, S., and Zhang, L. (2025 a ), MPO : An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, arXiv preprint arXiv:2502.18699

  80. [80]

    X., Wang, L., and Lu, J

    Wang, Z., Han, Y., Fang, E. X., Wang, L., and Lu, J. (2025 b ), Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation, arXiv preprint arXiv:2412.05506

Showing first 80 references.