RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Pith reviewed 2026-05-18 00:41 UTC · model grok-4.3
The pith
Generative foundation models align better by fine-tuning on samples ranked high by a reward model rather than using reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Reward rAnked FineTuning (RAFT) effectively aligns generative models by selecting high-quality samples based on a reward model and fine-tuning the model on these filtered samples, leading to improvements in both reward learning and other automated metrics for large language models and diffusion models.
What carries the argument
Reward rAnked FineTuning (RAFT), the process of ranking generated samples by reward score and performing supervised fine-tuning on the retained high-reward subset.
If this is right
- Alignment of generative models becomes feasible without the instabilities of RL algorithms.
- Both language models and diffusion models can be aligned using the same filtering and fine-tuning procedure.
- Performance improves on reward-based metrics and other automated evaluations.
- Potential reduction in reward hacking compared to full RLHF.
Where Pith is reading between the lines
- RAFT may require fewer computational resources than RLHF by avoiding online policy optimization.
- This method could be extended by iterating the process with updated reward models.
- Similar ranking and filtering ideas might apply to other alignment tasks beyond generative models.
Load-bearing premise
A fixed reward model can accurately distinguish desirable from undesirable samples without the model learning to game the reward function during fine-tuning.
What would settle it
If human evaluators rate RAFT-tuned model outputs as no better or worse than the original model or RLHF-tuned versions on preference tasks, the approach would be falsified.
read the original abstract
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Reward rAnked FineTuning (RAFT) as a simpler alternative to RLHF for aligning generative foundation models. RAFT generates multiple samples per prompt, ranks them with a fixed reward model, retains only the highest-ranked samples, and performs standard supervised fine-tuning on the retained subset. The central claim is that this procedure yields stable improvements in both reward-model scores and other automated metrics for large language models and diffusion models while avoiding the instabilities and reward-hacking problems associated with RLHF.
Significance. If the empirical results are robust, RAFT would be a practically useful simplification of the alignment pipeline: it replaces RL optimization with ordinary SFT on filtered data, lowering computational cost and training instability. Demonstrating gains on both LLMs and diffusion models would further increase its relevance. The approach is not parameter-free (the number of samples generated per prompt remains a free hyper-parameter), but the core idea is straightforward and falsifiable.
major comments (3)
- [§3] §3 (Method description): the procedure for sample generation and selection is described only at a high level. No value or range is given for the number of samples drawn per prompt, nor is the precise selection rule (top-k, threshold on reward score, etc.) stated. Because the central claim rests on the quality of the filtered subset, these missing details prevent evaluation of whether the reported gains are reproducible or sensitive to the choice of k.
- [§4] §4 (Experiments): no head-to-head comparison against RLHF is presented, nor are there ablations that vary reward-model accuracy or inject controlled noise into the reward model. Without such controls it is impossible to substantiate the claim that RAFT is less prone to reward hacking than RLHF when the reward model is imperfect.
- [§5] §5 (Results): the reported improvements in reward scores and automated metrics are stated without effect sizes, number of independent runs, error bars, or statistical tests. The abstract itself supplies no quantitative numbers, making it impossible to judge whether the observed gains exceed what would be expected from simply increasing supervised data volume.
minor comments (2)
- [§3] The reward-model notation and the precise definition of the filtered training set could be written with a short equation or pseudocode block for clarity.
- [§2] A brief discussion of how RAFT relates to prior work on rejection sampling or best-of-n decoding would help situate the contribution.
Simulated Author's Rebuttal
We appreciate the referee's constructive feedback on our manuscript. We address each major comment below and indicate revisions made to strengthen the paper.
read point-by-point responses
-
Referee: §3 (Method description): the procedure for sample generation and selection is described only at a high level. No value or range is given for the number of samples drawn per prompt, nor is the precise selection rule (top-k, threshold on reward score, etc.) stated. Because the central claim rests on the quality of the filtered subset, these missing details prevent evaluation of whether the reported gains are reproducible or sensitive to the choice of k.
Authors: We agree that the original method section provided only a high-level description. In the revised manuscript we now explicitly state that 4 samples are generated per prompt and the single highest-reward sample is retained for fine-tuning. We also report the range of sample counts (2–8) explored in ablations and include pseudocode for the full RAFT procedure to support reproducibility. revision: yes
-
Referee: §4 (Experiments): no head-to-head comparison against RLHF is presented, nor are there ablations that vary reward-model accuracy or inject controlled noise into the reward model. Without such controls it is impossible to substantiate the claim that RAFT is less prone to reward hacking than RLHF when the reward model is imperfect.
Authors: We acknowledge the value of direct RLHF comparisons and controlled reward-model ablations. We have added an ablation that injects synthetic noise into reward scores to test robustness under imperfect rewards. A full-scale head-to-head RLHF run was omitted due to prohibitive compute cost; we instead reference published RLHF baselines and discuss this limitation explicitly in the revised text. revision: partial
-
Referee: §5 (Results): the reported improvements in reward scores and automated metrics are stated without effect sizes, number of independent runs, error bars, or statistical tests. The abstract itself supplies no quantitative numbers, making it impossible to judge whether the observed gains exceed what would be expected from simply increasing supervised data volume.
Authors: We agree that statistical details were insufficient. The revised results section now reports means and standard deviations over three independent runs, includes effect sizes, and adds t-test p-values. The abstract has been updated with the key quantitative improvements observed on both LLM and diffusion tasks. revision: yes
Circularity Check
No significant circularity; RAFT uses external reward model as input
full rationale
The paper describes RAFT as a practical procedure that takes a pre-existing reward model as an external input, generates samples from the base model, filters them by ranking with that fixed reward model, and then applies standard supervised fine-tuning to the retained high-reward subset. No equation or derivation reduces the claimed performance gains to a quantity defined by the same fitted reward model in a self-referential loop. Improvements are reported on separate automated metrics beyond the reward score itself, and the method is positioned as an alternative to RLHF without invoking self-citations, uniqueness theorems, or ansatzes that collapse back to the paper's own fitted values. The derivation chain remains self-contained as an empirical training recipe.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of samples generated per prompt
axioms (1)
- domain assumption Reward model scores correlate with human preferences on the target distribution
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation
BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and com...
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
AlignCultura: Towards Culturally Aligned Large Language Models?
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
-
Bias at the End of the Score
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.
-
Reinforced Self-Training (ReST) for Language Modeling
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
-
ReMedi: Reasoner for Medical Clinical Prediction
ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reference graph
Works this paper leans on
-
[4]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 610--623, 2021
work page 2021
-
[7]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952
work page 1952
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[14]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34: 0 8780--8794, 2021
work page 2021
-
[15]
Lmflow: An extensible toolkit for finetuning and inference of large foundation models
Shizhe Diao, Rui Pan, Hanze Dong, KaShun Shum, Jipeng Zhang, Wei Xiong, and Tong Zhang. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. https://optimalscale.github.io/LMFlow/, 2023
work page 2023
-
[19]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023
work page 2023
-
[20]
Openllama: An open reproduction of llama, May 2023
Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama
work page 2023
-
[23]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020
work page 2020
-
[29]
Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp.\ 5084--5096. PMLR, 2021
work page 2021
-
[30]
Studies in language behavior: A program of research
Wendell Johnson. Studies in language behavior: A program of research. Psychological Monographs, 56 0 (2): 0 1--15, 1944
work page 1944
-
[33]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023
work page 2023
-
[35]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55 0 (9): 0 1--35, 2023
work page 2023
-
[36]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...
work page 2011
-
[39]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[41]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[45]
Raj Reddy. Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university. Pittsburgh, Pa, 1977
work page 1977
-
[46]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022
work page 2022
-
[53]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 0 3008--3021, 2020
work page 2020
-
[57]
High-dimensional statistics: A non-asymptotic viewpoint, volume 48
Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019
work page 2019
-
[60]
Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022 a . URL https://openreview.net/fo...
work page 2022
-
[64]
Policy finetuning: Bridging sample-efficient offline and online reinforcement learning
Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34, 2021
work page 2021
-
[68]
Huggingface , author =
-
[69]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[70]
Training Diffusion Models with Reinforcement Learning
Training diffusion models with reinforcement learning , author=. arXiv preprint arXiv:2305.13301 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=
work page 2019
-
[72]
Score-Based Generative Modeling through Stochastic Differential Equations
Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[73]
Denoising Diffusion Implicit Models
Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[74]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[75]
arXiv preprint arXiv:2303.14420 , year=
Better Aligning Text-to-Image Models with Human Preference , author=. arXiv preprint arXiv:2303.14420 , year=
-
[76]
Advances in Neural Information Processing Systems , volume=
Maximum likelihood training of score-based diffusion models , author=. Advances in Neural Information Processing Systems , volume=
-
[77]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
A General Language Assistant as a Laboratory for Alignment
A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
RRHF: Rank Responses to Align Language Models with Human Feedback without tears , author=. arXiv preprint arXiv:2304.05302 , year=
-
[80]
GitHub repository , howpublished =
Shizhe Diao and Rui Pan and Hanze Dong and KaShun Shum and Jipeng Zhang and Wei Xiong and Tong Zhang , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[81]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[82]
Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =
-
[83]
Advances in Neural Information Processing Systems , volume=
Diffusion models beat gans on image synthesis , author=. Advances in Neural Information Processing Systems , volume=
-
[84]
Aligning Text-to-Image Models using Human Feedback
Aligning text-to-image models using human feedback , author=. arXiv preprint arXiv:2302.12192 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[85]
arXiv preprint arXiv:2212.09611 , year=
Optimizing Prompts for Text-to-Image Generation , author=. arXiv preprint arXiv:2212.09611 , year=
-
[86]
On the Opportunities and Risks of Foundation Models
On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[87]
Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=
work page 2021
-
[88]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[89]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[90]
Advances in Neural Information Processing Systems , volume=
Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=
-
[91]
Finetuned Language Models Are Zero-Shot Learners
Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[92]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Self-Instruct: Aligning Language Model with Self Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[93]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[94]
arXiv preprint arXiv:2005.12729 , year=
Implementation matters in deep policy gradients: A case study on ppo and trpo , author=. arXiv preprint arXiv:2005.12729 , year=
-
[95]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[96]
arXiv preprint arXiv:1907.01752 , year=
On the weaknesses of reinforcement learning for neural machine translation , author=. arXiv preprint arXiv:1907.01752 , year=
-
[97]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. ArXiv , year=
-
[98]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[99]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[100]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Realtoxicityprompts: Evaluating neural toxic degeneration in language models , author=. arXiv preprint arXiv:2009.11462 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[101]
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , author=. ACL , year=
-
[102]
GitHub repository , howpublished =
Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert , title =. GitHub repository , howpublished =. 2020 , publisher =
work page 2020
-
[103]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[104]
Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=
-
[105]
International Conference on Machine Learning , pages=
Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[106]
arXiv preprint arXiv:1709.07174 , year=
Agile autonomous driving using end-to-end deep imitation learning , author=. arXiv preprint arXiv:1709.07174 , year=
-
[107]
Recursively Summarizing Books with Human Feedback
Recursively summarizing books with human feedback , author=. arXiv preprint arXiv:2109.10862 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[108]
Fine-Tuning Language Models from Human Preferences
Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[109]
Jerome H. Friedman , title =. The Annals of Statistics , number =. 2001 , doi =
work page 2001
-
[110]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[111]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[112]
Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model , author=. arXiv preprint arXiv:2201.11990 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[113]
Training Compute-Optimal Large Language Models
Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[114]
Scalable agent alignment via reward modeling: a research direction
Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[115]
Supervising strong learners by amplifying weak experts
Supervising strong learners by amplifying weak experts , author=. arXiv preprint arXiv:1810.08575 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[116]
AI safety via debate , author=. arXiv preprint arXiv:1805.00899 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[117]
Advances in Neural Information Processing Systems , volume=
Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[118]
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[119]
Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[120]
WebGPT: Browser-assisted question-answering with human feedback
Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[121]
The method of paired comparisons , author=
Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=
work page 1952
- [122]
-
[123]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. arXiv preprint arXiv:2101.00027 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[124]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.