pith. machine review for the scientific record. sign in

arxiv: 2304.06767 · v4 · pith:5HP625FJnew · submitted 2023-04-13 · 💻 cs.LG · cs.AI· cs.CL· cs.CV· stat.ML

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Pith reviewed 2026-05-18 00:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CVstat.ML
keywords Reward ranked fine-tuningGenerative model alignmentRLHF alternativeLarge language modelsDiffusion modelsSupervised fine-tuningReward model
0
0 comments X

The pith

Generative foundation models align better by fine-tuning on samples ranked high by a reward model rather than using reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reward rAnked FineTuning, or RAFT, as a method to align generative models with human preferences. Instead of applying reinforcement learning guided by a reward model, RAFT generates many samples from the model, ranks them using the reward model, discards the low-ranked ones, and fine-tunes the model on the high-quality subset using standard supervised learning. This is tested on large language models and diffusion models, showing gains in reward scores and other metrics. A reader would care if this simpler approach delivers comparable or better alignment without the known difficulties of RLHF.

Core claim

The central claim is that Reward rAnked FineTuning (RAFT) effectively aligns generative models by selecting high-quality samples based on a reward model and fine-tuning the model on these filtered samples, leading to improvements in both reward learning and other automated metrics for large language models and diffusion models.

What carries the argument

Reward rAnked FineTuning (RAFT), the process of ranking generated samples by reward score and performing supervised fine-tuning on the retained high-reward subset.

If this is right

  • Alignment of generative models becomes feasible without the instabilities of RL algorithms.
  • Both language models and diffusion models can be aligned using the same filtering and fine-tuning procedure.
  • Performance improves on reward-based metrics and other automated evaluations.
  • Potential reduction in reward hacking compared to full RLHF.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RAFT may require fewer computational resources than RLHF by avoiding online policy optimization.
  • This method could be extended by iterating the process with updated reward models.
  • Similar ranking and filtering ideas might apply to other alignment tasks beyond generative models.

Load-bearing premise

A fixed reward model can accurately distinguish desirable from undesirable samples without the model learning to game the reward function during fine-tuning.

What would settle it

If human evaluators rate RAFT-tuned model outputs as no better or worse than the original model or RLHF-tuned versions on preference tasks, the approach would be falsified.

read the original abstract

Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Reward rAnked FineTuning (RAFT) as a simpler alternative to RLHF for aligning generative foundation models. RAFT generates multiple samples per prompt, ranks them with a fixed reward model, retains only the highest-ranked samples, and performs standard supervised fine-tuning on the retained subset. The central claim is that this procedure yields stable improvements in both reward-model scores and other automated metrics for large language models and diffusion models while avoiding the instabilities and reward-hacking problems associated with RLHF.

Significance. If the empirical results are robust, RAFT would be a practically useful simplification of the alignment pipeline: it replaces RL optimization with ordinary SFT on filtered data, lowering computational cost and training instability. Demonstrating gains on both LLMs and diffusion models would further increase its relevance. The approach is not parameter-free (the number of samples generated per prompt remains a free hyper-parameter), but the core idea is straightforward and falsifiable.

major comments (3)
  1. [§3] §3 (Method description): the procedure for sample generation and selection is described only at a high level. No value or range is given for the number of samples drawn per prompt, nor is the precise selection rule (top-k, threshold on reward score, etc.) stated. Because the central claim rests on the quality of the filtered subset, these missing details prevent evaluation of whether the reported gains are reproducible or sensitive to the choice of k.
  2. [§4] §4 (Experiments): no head-to-head comparison against RLHF is presented, nor are there ablations that vary reward-model accuracy or inject controlled noise into the reward model. Without such controls it is impossible to substantiate the claim that RAFT is less prone to reward hacking than RLHF when the reward model is imperfect.
  3. [§5] §5 (Results): the reported improvements in reward scores and automated metrics are stated without effect sizes, number of independent runs, error bars, or statistical tests. The abstract itself supplies no quantitative numbers, making it impossible to judge whether the observed gains exceed what would be expected from simply increasing supervised data volume.
minor comments (2)
  1. [§3] The reward-model notation and the precise definition of the filtered training set could be written with a short equation or pseudocode block for clarity.
  2. [§2] A brief discussion of how RAFT relates to prior work on rejection sampling or best-of-n decoding would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. We address each major comment below and indicate revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: §3 (Method description): the procedure for sample generation and selection is described only at a high level. No value or range is given for the number of samples drawn per prompt, nor is the precise selection rule (top-k, threshold on reward score, etc.) stated. Because the central claim rests on the quality of the filtered subset, these missing details prevent evaluation of whether the reported gains are reproducible or sensitive to the choice of k.

    Authors: We agree that the original method section provided only a high-level description. In the revised manuscript we now explicitly state that 4 samples are generated per prompt and the single highest-reward sample is retained for fine-tuning. We also report the range of sample counts (2–8) explored in ablations and include pseudocode for the full RAFT procedure to support reproducibility. revision: yes

  2. Referee: §4 (Experiments): no head-to-head comparison against RLHF is presented, nor are there ablations that vary reward-model accuracy or inject controlled noise into the reward model. Without such controls it is impossible to substantiate the claim that RAFT is less prone to reward hacking than RLHF when the reward model is imperfect.

    Authors: We acknowledge the value of direct RLHF comparisons and controlled reward-model ablations. We have added an ablation that injects synthetic noise into reward scores to test robustness under imperfect rewards. A full-scale head-to-head RLHF run was omitted due to prohibitive compute cost; we instead reference published RLHF baselines and discuss this limitation explicitly in the revised text. revision: partial

  3. Referee: §5 (Results): the reported improvements in reward scores and automated metrics are stated without effect sizes, number of independent runs, error bars, or statistical tests. The abstract itself supplies no quantitative numbers, making it impossible to judge whether the observed gains exceed what would be expected from simply increasing supervised data volume.

    Authors: We agree that statistical details were insufficient. The revised results section now reports means and standard deviations over three independent runs, includes effect sizes, and adds t-test p-values. The abstract has been updated with the key quantitative improvements observed on both LLM and diffusion tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RAFT uses external reward model as input

full rationale

The paper describes RAFT as a practical procedure that takes a pre-existing reward model as an external input, generates samples from the base model, filters them by ranking with that fixed reward model, and then applies standard supervised fine-tuning to the retained high-reward subset. No equation or derivation reduces the claimed performance gains to a quantity defined by the same fitted reward model in a self-referential loop. Improvements are reported on separate automated metrics beyond the reward score itself, and the method is positioned as an alternative to RLHF without invoking self-citations, uniqueness theorems, or ansatzes that collapse back to the paper's own fitted values. The derivation chain remains self-contained as an empirical training recipe.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the external reward model is sufficiently accurate to identify desirable samples and that ordinary fine-tuning on those samples yields better alignment than RLHF without introducing new failure modes.

free parameters (1)
  • number of samples generated per prompt
    The method requires drawing a sufficient number of samples to allow effective filtering; the exact count is not specified in the abstract.
axioms (1)
  • domain assumption Reward model scores correlate with human preferences on the target distribution
    The filtering step uses the reward model to decide which samples to keep; if this correlation is weak, the fine-tuning set will contain undesired behavior.

pith-pipeline@v0.9.0 · 5773 in / 1264 out tokens · 32932 ms · 2026-05-18T00:41:50.623940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  2. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  3. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  4. Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation

    cs.IR 2026-05 conditional novelty 7.0

    BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and com...

  5. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  6. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  7. CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

    cs.AI 2026-05 unverdicted novelty 6.0

    CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.

  8. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  9. AlignCultura: Towards Culturally Aligned Large Language Models?

    cs.CL 2026-04 unverdicted novelty 6.0

    Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

  10. Bias at the End of the Score

    cs.CV 2026-04 unverdicted novelty 6.0

    Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.

  11. Improving Video Generation with Human Feedback

    cs.CV 2025-01 unverdicted novelty 6.0

    A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

  12. Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    cs.CV 2023-09 conditional novelty 6.0

    DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.

  13. Reinforced Self-Training (ReST) for Language Modeling

    cs.CL 2023-08 unverdicted novelty 6.0

    ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.

  14. ReMedi: Reasoner for Medical Clinical Prediction

    cs.CL 2026-05 unverdicted novelty 5.0

    ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.

  15. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    cs.AI 2023-08 accept novelty 5.0

    Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.

  16. Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation

    cs.IR 2026-04 unverdicted novelty 4.0

    Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.

  17. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 16 Pith papers · 35 internal anchors

  1. [4]

    On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 610--623, 2021

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 610--623, 2021

  2. [7]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

  3. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  4. [14]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34: 0 8780--8794, 2021

  5. [15]

    Lmflow: An extensible toolkit for finetuning and inference of large foundation models

    Shizhe Diao, Rui Pan, Hanze Dong, KaShun Shum, Jipeng Zhang, Wei Xiong, and Tong Zhang. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. https://optimalscale.github.io/LMFlow/, 2023

  6. [19]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

  7. [20]

    Openllama: An open reproduction of llama, May 2023

    Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama

  8. [23]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

  9. [29]

    Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp.\ 5084--5096

    Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp.\ 5084--5096. PMLR, 2021

  10. [30]

    Studies in language behavior: A program of research

    Wendell Johnson. Studies in language behavior: A program of research. Psychological Monographs, 56 0 (2): 0 1--15, 1944

  11. [33]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

  12. [35]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55 0 (9): 0 1--35, 2023

  13. [36]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

  14. [39]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

  15. [40]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  16. [41]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  17. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  18. [45]

    Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university

    Raj Reddy. Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university. Pittsburgh, Pa, 1977

  19. [46]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022

  20. [53]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 0 3008--3021, 2020

  21. [57]

    High-dimensional statistics: A non-asymptotic viewpoint, volume 48

    Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

  22. [60]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022 a . URL https://openreview.net/fo...

  23. [64]

    Policy finetuning: Bridging sample-efficient offline and online reinforcement learning

    Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34, 2021

  24. [68]

    Huggingface , author =

  25. [69]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  26. [70]

    Training Diffusion Models with Reinforcement Learning

    Training diffusion models with reinforcement learning , author=. arXiv preprint arXiv:2305.13301 , year=

  27. [71]

    2019 , publisher=

    High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

  28. [72]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

  29. [73]

    Denoising Diffusion Implicit Models

    Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=

  30. [74]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  31. [75]

    arXiv preprint arXiv:2303.14420 , year=

    Better Aligning Text-to-Image Models with Human Preference , author=. arXiv preprint arXiv:2303.14420 , year=

  32. [76]

    Advances in Neural Information Processing Systems , volume=

    Maximum likelihood training of score-based diffusion models , author=. Advances in Neural Information Processing Systems , volume=

  33. [77]

    LoRA: Low-Rank Adaptation of Large Language Models

    Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

  34. [78]

    A General Language Assistant as a Laboratory for Alignment

    A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

  35. [79]

    Rrhf: Rank responses to align language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023

    RRHF: Rank Responses to Align Language Models with Human Feedback without tears , author=. arXiv preprint arXiv:2304.05302 , year=

  36. [80]

    GitHub repository , howpublished =

    Shizhe Diao and Rui Pan and Hanze Dong and KaShun Shum and Jipeng Zhang and Wei Xiong and Tong Zhang , title =. GitHub repository , howpublished =. 2023 , publisher =

  37. [81]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

  38. [82]

    Ilharco, M

    Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

  39. [83]

    Advances in Neural Information Processing Systems , volume=

    Diffusion models beat gans on image synthesis , author=. Advances in Neural Information Processing Systems , volume=

  40. [84]

    Aligning Text-to-Image Models using Human Feedback

    Aligning text-to-image models using human feedback , author=. arXiv preprint arXiv:2302.12192 , year=

  41. [85]

    arXiv preprint arXiv:2212.09611 , year=

    Optimizing Prompts for Text-to-Image Generation , author=. arXiv preprint arXiv:2212.09611 , year=

  42. [86]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  43. [87]

    Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

    On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

  44. [88]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=

  45. [89]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  46. [90]

    Advances in Neural Information Processing Systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

  47. [91]

    Finetuned Language Models Are Zero-Shot Learners

    Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

  48. [92]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Self-Instruct: Aligning Language Model with Self Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=

  49. [93]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  50. [94]

    arXiv preprint arXiv:2005.12729 , year=

    Implementation matters in deep policy gradients: A case study on ppo and trpo , author=. arXiv preprint arXiv:2005.12729 , year=

  51. [95]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  52. [96]

    arXiv preprint arXiv:1907.01752 , year=

    On the weaknesses of reinforcement learning for neural machine translation , author=. arXiv preprint arXiv:1907.01752 , year=

  53. [97]

    ArXiv , year=

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. ArXiv , year=

  54. [98]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  55. [99]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  56. [100]

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    Realtoxicityprompts: Evaluating neural toxic degeneration in language models , author=. arXiv preprint arXiv:2009.11462 , year=

  57. [101]

    ACL , year=

    Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , author=. ACL , year=

  58. [102]

    GitHub repository , howpublished =

    Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert , title =. GitHub repository , howpublished =. 2020 , publisher =

  59. [103]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  60. [104]

    Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

    Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

  61. [105]

    International Conference on Machine Learning , pages=

    Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  62. [106]

    arXiv preprint arXiv:1709.07174 , year=

    Agile autonomous driving using end-to-end deep imitation learning , author=. arXiv preprint arXiv:1709.07174 , year=

  63. [107]

    Recursively Summarizing Books with Human Feedback

    Recursively summarizing books with human feedback , author=. arXiv preprint arXiv:2109.10862 , year=

  64. [108]

    Fine-Tuning Language Models from Human Preferences

    Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

  65. [109]

    Friedman , title =

    Jerome H. Friedman , title =. The Annals of Statistics , number =. 2001 , doi =

  66. [110]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

  67. [111]

    PaLM: Scaling Language Modeling with Pathways

    Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

  68. [112]

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model , author=. arXiv preprint arXiv:2201.11990 , year=

  69. [113]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

  70. [114]

    Scalable agent alignment via reward modeling: a research direction

    Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=

  71. [115]

    Supervising strong learners by amplifying weak experts

    Supervising strong learners by amplifying weak experts , author=. arXiv preprint arXiv:1810.08575 , year=

  72. [116]

    AI safety via debate

    AI safety via debate , author=. arXiv preprint arXiv:1805.00899 , year=

  73. [117]

    Advances in Neural Information Processing Systems , volume=

    Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  74. [118]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  75. [119]

    Improving alignment of dialogue agents via targeted human judgements

    Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=

  76. [120]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

  77. [121]

    The method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

  78. [122]

    ArXiv , year=

    GPT-4 Technical Report , author=. ArXiv , year=

  79. [123]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. arXiv preprint arXiv:2101.00027 , year=

  80. [124]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Showing first 80 references.