Recognition: no theorem link
Reinforcement Learning from Human Feedback: A Statistical Perspective
Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3
The pith
RLHF models noisy human preferences as latent utilities to estimate rewards and align large language model policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that RLHF is best understood as a statistical estimation problem in which human preferences supply noisy observations of an underlying reward function, typically modeled by the Bradley-Terry-Luce framework, and that both two-stage and direct policy optimization methods can be analyzed and improved using principles from experimental design and uncertainty quantification.
What carries the argument
The Bradley-Terry-Luce model, which treats observed pairwise preferences as probabilistic outcomes driven by latent reward differences, supplies the mechanism for turning preference data into estimated reward functions that then guide policy optimization.
If this is right
- Reward models estimated under Bradley-Terry-Luce assumptions produce policies whose performance scales with the volume and quality of preference data collected via active learning.
- Direct preference optimization bypasses explicit reward modeling yet inherits the same statistical identifiability conditions as two-stage RLHF.
- Uncertainty-aware data collection reduces the number of human queries needed to reach a target alignment level.
- Extensions to verifiable rewards replace subjective preferences with objective correctness signals, altering the statistical estimation target from latent utilities to observable outcomes.
Where Pith is reading between the lines
- Statistical analysis of RLHF pipelines could guide the design of benchmarks that measure not only final performance but also the stability of learned rewards across different preference distributions.
- Inference-time algorithms may allow dynamic adjustment of policies without full retraining, provided uncertainty estimates remain reliable at deployment.
- Heterogeneous feedback across user groups suggests that mixture or hierarchical extensions to the Bradley-Terry-Luce model would be a natural next modeling step.
Load-bearing premise
That standard preference models such as Bradley-Terry-Luce and associated uncertainty methods are sufficient to capture the real variability and heterogeneity in human judgments.
What would settle it
A large-scale preference dataset in which the Bradley-Terry-Luce model yields systematically biased reward estimates that cannot be corrected by added noise terms or simple mixture extensions, leading to measurably worse policy alignment than alternative models.
Figures
read the original abstract
Reinforcement learning from human feedback (RLHF) has emerged as a central framework for aligning large language models (LLMs) with human preferences. Despite its practical success, RLHF raises fundamental statistical questions because it relies on noisy, subjective, and often heterogeneous feedback to learn reward models and optimize policies. This survey provides a statistical perspective on RLHF, focusing primarily on the LLM alignment setting. We introduce the main components of RLHF, including supervised fine-tuning, reward modeling, and policy optimization, and relate them to familiar statistical ideas such as Bradley-Terry-Luce (BTL) model, latent utility estimation, active learning, experimental design, and uncertainty quantification. We review methods for learning reward functions from pairwise preference data and for optimizing policies through both two-stage RLHF pipelines and emerging one-stage approaches such as direct preference optimization. We further discuss recent extensions including reinforcement learning from AI feedback, inference-time algorithms, and reinforcement learning from verifiable rewards, as well as benchmark datasets, evaluation protocols, and open-source frameworks that support RLHF research. We conclude by highlighting open challenges in RLHF. An accompanying GitHub demo https://github.com/Pangpang-Liu/RLHF_demo illustrates key components of the RLHF pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey provides a statistical perspective on RLHF for aligning large language models. It introduces the core components—supervised fine-tuning, reward modeling from pairwise preferences, and policy optimization—and maps them to established statistical frameworks including the Bradley-Terry-Luce model, latent utility estimation, active learning, experimental design, and uncertainty quantification. The paper reviews both two-stage RLHF pipelines and one-stage alternatives such as direct preference optimization, covers extensions including RL from AI feedback, inference-time methods, and RL from verifiable rewards, and discusses benchmarks, evaluation protocols, and open-source tools. It concludes by highlighting open challenges in the area, accompanied by a GitHub demonstration repository.
Significance. If the mappings and reviews hold, the paper offers a timely synthesis that connects practical RLHF techniques to statistical principles for handling noisy and heterogeneous feedback. This organization can help researchers identify connections to active learning and uncertainty quantification, while the explicit discussion of open challenges and the accompanying demo support accessibility and future work in the field.
minor comments (2)
- The abstract provides the GitHub demo URL, but repeating the exact link in the conclusion or a dedicated resources section would improve reader convenience.
- In the review of reward modeling methods, ensure that the presentation of the BTL model includes a brief reminder of its key identifiability assumptions to aid readers less familiar with the statistical literature.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript and the recommendation to accept. The review accurately captures the paper's focus on connecting RLHF components to statistical models such as the Bradley-Terry-Luce framework, active learning, and uncertainty quantification, as well as its coverage of both established pipelines and recent extensions.
Circularity Check
No significant circularity: survey organizes cited literature without new derivations
full rationale
This is a survey paper that introduces RLHF components and relates them to established statistical ideas (BTL model, active learning, uncertainty quantification) drawn from prior literature. It reviews methods for reward modeling and policy optimization, discusses extensions, and flags open challenges without presenting original theorems, proofs, fitted parameters, or predictions. No equations reduce to self-inputs by construction, no self-citations serve as load-bearing uniqueness claims, and no ansatzes or renamings are smuggled in. The central contribution is organizational and expository, remaining self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human preferences can be modeled using the Bradley-Terry-Luce model for pairwise comparisons
Forward citations
Cited by 3 Pith papers
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...
-
Perturbation is All You Need for Extrapolating Language Models
Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
-
Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
DRRO for RLHF replaces worst-case value with worst-case regret in Wasserstein DRO, producing an exact water-filling solution under l1 ambiguity and a practical sampled-bonus algorithm that reduces proxy over-optimization.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in " " * FUNCTION format....
-
[3]
M., Rafailov, R., Sharkov, S., Li, X., and Koyejo, S
Ahmed, A. M., Rafailov, R., Sharkov, S., Li, X., and Koyejo, S. (2024), Scalable Ensembling For Mitigating Reward Overoptimisation, in ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models
work page 2024
-
[4]
R., Beirami, A., and Mroueh, Y
Aminian, G., Shenfeld, I., Asadi, A. R., Beirami, A., and Mroueh, Y. (2025), Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis, in ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
work page 2025
-
[5]
Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. (2024), A general theoretical paradigm to understand learning from human preferences, in International Conference on Artificial Intelligence and Statistics, PMLR, pp. 4447--4455
work page 2024
-
[6]
Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., Zheng, B., et al. (2024), Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7421--7454
work page 2024
-
[7]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022 a ), Training a helpful and harmless assistant with reinforcement learning from human feedback, arXiv preprint arXiv:2204.05862
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. (2022 b ), Constitutional AI: Harmlessness from AI feedback, arXiv preprint arXiv:2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Barnett, P., Freedman, R., Svegliato, J., and Russell, S. (2023), Active Reward Learning from Multiple Teachers, in The AAAI Workshop on Artificial Intelligence Safety
work page 2023
-
[10]
Bradley, R. A. and Terry, M. E. (1952), Rank analysis of incomplete block designs: I. The method of paired comparisons, Biometrika, 39, 324--345
work page 1952
-
[11]
Chakraborty, S., Qiu, J., Yuan, H., Koppel, A., Manocha, D., Huang, F., Bedi, A., and Wang, M. (2024), MaxMin- RLHF : Alignment with Diverse Human Preferences, in International Conference on Machine Learning, PMLR, pp. 6116--6135
work page 2024
-
[12]
Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., and Wang, B. (2025), Towards Medical Complex Reasoning with LLM s through Medical Verifiable Problems, in Findings of the Association for Computational Linguistics: ACL 2025, eds. Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., Vienna, Austria: Association for Computational Linguistics, pp. 1...
work page 2025
-
[13]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021), Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [14]
- [15]
-
[16]
Chow, Y., Tennenholtz, G., Gur, I., Zhuang, V., Dai, B., Kumar, A., Agarwal, R., Thiagarajan, S., Boutilier, C., and Faust, A. (2025), Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, in The Thirteenth International Conference on Learning Representations
work page 2025
-
[17]
Chu, X., Huang, H., Zhang, X., Wei, F., and Wang, Y. (2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546
-
[18]
Coste, T., Anwar, U., Kirk, R., and Krueger, D. (2024), Reward Model Ensembles Help Mitigate Overoptimization, in The Twelfth International Conference on Learning Representations
work page 2024
-
[19]
Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., and Sun, M. (2024), ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback, in Forty-first International Conference on Machine Learning
work page 2024
-
[20]
Daniels-Koch, O. and Freedman, R. (2022), The Expertise Problem: Learning from Specialized Feedback, in NeurIPS ML Safety Workshop
work page 2022
- [21]
-
[22]
Dong, Z., Zhang, Z., Zhou, Y., Jin, C., Wu, R., and Zhang, L. (2026), Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals, arXiv preprint arXiv:2602.03061
-
[23]
Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D'Amour, A. N., Dvijotham, K. D., Fisch, A., Heller, K. A., Pfohl, S. R., Ramachandran, D., Shaw, P., and Berant, J. (2024), Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking, in First Conference on Language Modeling
work page 2024
-
[24]
Fan, J., Hou, J., and Yu, M. (2024 a ), Covariate Assisted Entity Ranking with Sparse Intrinsic Scores, arXiv preprint arXiv:2407.08814
-
[25]
Fan, J., Hou, J., and Yu, M. (2024 b ), Uncertainty quantification of MLE for entity ranking with covariates, Journal of Machine Learning Research, 25, 1--83
work page 2024
-
[26]
Fan, J., Lou, Z., Wang, W., and Yu, M. (2025), Ranking inferences based on the top choice of multiway comparisons, Journal of the American Statistical Association, 120, 237--250
work page 2025
-
[27]
Fang, H. et al. (2024), Efficiently Acquiring Human Feedback with Bayesian Deep Ranking for Preference Learning, in Proceedings of the 2024 Conference on Uncertainty in Natural Language Processing
work page 2024
-
[28]
Active teacher selection for reward learning
Freedman, R., Svegliato, J., Wray, K., and Russell, S. (2025), Active teacher selection for reinforcement learning from human feedback, arXiv preprint arXiv:2310.15288
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
N., Jiao, J., Zhu, B., Gonzalez, J
Frick, E., Li, T., Chen, C., Chiang, W.-L., Angelopoulos, A. N., Jiao, J., Zhu, B., Gonzalez, J. E., and Stoica, I. (2025), How to Evaluate Reward Models for RLHF , in The Thirteenth International Conference on Learning Representations
work page 2025
-
[30]
Gallego, V. (2024), Refined direct preference optimization with synthetic data for behavioral alignment of llms, in International Conference on Machine Learning, Optimization, and Data Science, Springer, pp. 92--105
work page 2024
-
[31]
Gao, C., Shen, Y., and Zhang, A. Y. (2023 a ), Uncertainty quantification in the Bradley--Terry--Luce model, Information and Inference: A Journal of the IMA, 12, 1073--1140
work page 2023
-
[32]
Gao, L., Schulman, J., and Hilton, J. (2023 b ), Scaling laws for reward model overoptimization, in International Conference on Machine Learning, PMLR, pp. 10835--10866
work page 2023
-
[33]
Gopalan, A., Chowdhury, S. R., and Banerjee, D. (2025), Why DPO is a Misspecified Estimator and How to Fix It, arXiv preprint arXiv:2510.20413
-
[34]
(2024), A survey on LLM-as-a-judge, The Innovation
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. (2024), A survey on LLM-as-a-judge, The Innovation
work page 2024
-
[35]
Gui, L., Garbacea, C., and Veitch, V. (2024), Bo NB oN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling, in The Thirty-eighth Annual Conference on Neural Information Processing Systems
work page 2024
-
[36]
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. (2025), DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature, 645, 633--638
work page 2025
-
[37]
Han, R., Ye, R., Tan, C., and Chen, K. (2020), Asymptotic theory of sparse Bradley--Terry model, The Annals of Applied Probability, 30, 2491--2515
work page 2020
-
[38]
Hao, B., Jain, R., Lattimore, T., Van Roy, B., and Wen, Z. (2023), Leveraging demonstrations to improve online learning: Quality matters, in International Conference on Machine Learning, PMLR, pp. 12527--12545
work page 2023
-
[39]
Havrilla, A., Zhuravinskyi, M., Phung, D., Tiwari, A., Tow, J., Biderman, S., Anthony, Q., and Castricato, L. (2023), trl X : A Framework for Large Scale Reinforcement Learning from Human Feedback, in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, eds. Bouamor, H., Pino, J., and Bali, K., Singapore: Association for...
work page 2023
-
[40]
Ichihara, Y., Jinnai, Y., Morimura, T., Abe, K., Ariu, K., Sakamoto, M., and Uchibe, E. (2025), Evaluation of Best-of-N Sampling Strategies for Language Model Alignment, Transactions on Machine Learning Research
work page 2025
-
[41]
Jeon, H. J., Milli, S., and Dragan, A. (2020), Reward-rational (implicit) choice: A unifying formalism for reward learning, Advances in Neural Information Processing Systems, 33, 4415--4426
work page 2020
-
[42]
Ji, K., He, J., and Gu, Q. (2024), Reinforcement Learning from Human Feedback with Active Queries, arXiv preprint arXiv:2402.09401
-
[43]
Jinnai, Y., Morimura, T., Ariu, K., and Abe, K. (2024), Regularized best-of- N sampling to mitigate reward hacking for language model alignment, in ICML 2024 Workshop on Models of Human Feedback for AI Alignment
work page 2024
-
[44]
Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M
Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024), Prometheus 2: An open source language model specialized in evaluating other language models, in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4334--4353
work page 2024
-
[45]
R., Whitefield, A., Rottger, P., Bean, A
Kirk, H. R., Whitefield, A., Rottger, P., Bean, A. M., Margatina, K., Mosquera-Gomez, R., Ciro, J., Bartolo, M., Williams, A., He, H., et al. (2024), The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models, Advances in Neural Informa...
work page 2024
-
[46]
Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V., Liu, A., Dziri, N., Lyu, X., et al. (2025), Tulu 3: Pushing Frontiers in Open Language Model Post-Training, in Second Conference on Language Modeling
work page 2025
-
[47]
R., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S
Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K. R., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S. (2024 a ), RLAIF vs. RLHF : Scaling Reinforcement Learning from Human Feedback with AI Feedback, in Forty-first International Conference on Machine Learning
work page 2024
-
[48]
Lee, K., Smith, L. M., and Abbeel, P. (2021), PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training, in International Conference on Machine Learning, PMLR, pp. 6152--6163
work page 2021
-
[49]
Lee, S. J., Sun, W. W., and Liu, Y. (2024 b ), Low-rank contextual reinforcement learning from heterogeneous human feedback, arXiv preprint arXiv:2412.19436
-
[50]
J., Krishna, S., and Lakkaraju, H
Li, A. J., Krishna, S., and Lakkaraju, H. (2025), More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness, in The Thirteenth International Conference on Learning Representations
work page 2025
- [51]
-
[52]
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023), AlpacaEval: An Automatic Evaluator of Instruction-following Models, https://github.com/tatsu-lab/alpaca_eval
work page 2023
- [53]
- [54]
-
[55]
Liu, Y., Fang, E. X., and Lu, J. (2023), Lagrangian inference for ranking problems, Operations research, 71, 202--223
work page 2023
-
[56]
Liu, Y., Yao, Z., Min, R., Cao, Y., Hou, L., and Li, J. (2025 b ), Pairjudge RM : Perform best-of- N sampling with knockout tournament, arXiv preprint arXiv:2501.13007
-
[57]
Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. (2025 c ), Understanding R1-Zero-Like Training: A Critical Perspective, in Second Conference on Language Modeling
work page 2025
-
[58]
Contextual Online Uncertainty-Aware Preference Learning for Human Feedback
Lu, N., Fang, E. X., and Lu, J. (2025), Contextual Online Uncertainty-Aware Preference Learning for Human Feedback, arXiv preprint arXiv:2504.19342
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Melo, L. C. et al. (2024), Deep Bayesian Active Learning for Preference Modeling in Language Models, in Advances in Neural Information Processing Systems
work page 2024
-
[60]
Mukherjee, S., Lalitha, A., Kalantari, K., Deshmukh, A., Liu, G., Ma, Y., and Kveton, B. (2024), Optimal Design for Human Preference Elicitation, in The Thirty-eighth Annual Conference on Neural Information Processing Systems
work page 2024
-
[61]
WebGPT: Browser-assisted question-answering with human feedback
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2022), Webgpt: Browser-assisted question-answering with human feedback, arXiv preprint arXiv:2112.09332
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
(2026), Audit Trails for Accountability in Large Language Models, arXiv preprint arXiv:2601.20727
Ojewale, V., Suresh, H., and Venkatasubramanian, S. (2026), Audit Trails for Accountability in Large Language Models, arXiv preprint arXiv:2601.20727
-
[63]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022), Training language models to follow instructions with human feedback, Advances in neural information processing systems, 35, 27730--27744
work page 2022
-
[64]
Ouyang, S., Hu, Y., Chen, G., Li, Q., Zhang, F., and Liu, Y. (2025), Towards Reward Fairness in RLHF: From a Resource Allocation Perspective, in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
work page 2025
-
[65]
Pan, A., Bhatia, K., and Steinhardt, J. (2022), The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models, in International Conference on Learning Representations
work page 2022
-
[66]
Park, C., Liu, M., Kong, D., Zhang, K., and Ozdaglar, A. E. (2024), RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation, in ICML 2024 Workshop on Theoretical Foundations of Foundation Models
work page 2024
-
[67]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023), Direct Preference Optimization: Your Language Model is Secretly a Reward Model, in Thirty-seventh Conference on Neural Information Processing Systems
work page 2023
-
[68]
Rame, A., Vieillard, N., Hussenot, L., Dadashi, R., Cideron, G., Bachem, O., and Ferret, J. (2024), WARM : On the Benefits of Weight Averaged Reward Models, in Forty-first International Conference on Machine Learning
work page 2024
-
[69]
I., M \'e nard, P., Moulines, E., and Valko, M
Scheid, A., Boursier, E., Durmus, A., Jordan, M. I., M \'e nard, P., Moulines, E., and Valko, M. (2024), Optimal design for reward modeling in RLHF , arXiv preprint arXiv:2410.17055
-
[70]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017), Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[71]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. (2024), Deepseekmath: Pushing the limits of mathematical reasoning in open language models, arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
Shi, R., Song, M., Zhou, R., Zhang, Z., Fazel, M., and Du, S. S. (2025), Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO, arXiv preprint arXiv:2505.19770
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Simons, G. and Yao, Y.-C. (1999), Asymptotics when the number of parameters tends to infinity in the Bradley-Terry model for paired comparisons, The Annals of Statistics, 27, 1041--1060
work page 1999
-
[74]
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. (2020), Learning to summarize with human feedback, Advances in neural information processing systems, 33, 3008--3021
work page 2020
-
[75]
Sutton, R. S. and Barto, A. G. (2018), Reinforcement Learning: An Introduction, MIT Press
work page 2018
-
[76]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023), Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
N., Kaiser, ., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017), Attention is all you need, Advances in neural information processing systems, 30
work page 2017
-
[78]
(2020), TRL: Transformers Reinforcement Learning ,
von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., and Gallouédec, Q. (2020), TRL: Transformers Reinforcement Learning ,
work page 2020
-
[79]
Wang, T., Gui, D., Hu, Y., Lin, S., and Zhang, L. (2025 a ), MPO : An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment, arXiv preprint arXiv:2502.18699
-
[80]
Wang, Z., Han, Y., Fang, E. X., Wang, L., and Lu, J. (2025 b ), Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation, arXiv preprint arXiv:2412.05506
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.