arxiv: 2304.06767 · v4 · pith:5HP625FJnew · submitted 2023-04-13 · 💻 cs.LG · cs.AI· cs.CL· cs.CV· stat.ML

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Hanze Dong , Wei Xiong , Deepanshu Goyal , Yihan Zhang , Winnie Chow , Rui Pan , Shizhe Diao , Jipeng Zhang

show 2 more authors

Kashun Shum Tong Zhang

This is my paper

Pith reviewed 2026-05-18 00:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CVstat.ML

keywords Reward ranked fine-tuningGenerative model alignmentRLHF alternativeLarge language modelsDiffusion modelsSupervised fine-tuningReward model

0 comments

The pith

Generative foundation models align better by fine-tuning on samples ranked high by a reward model rather than using reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reward rAnked FineTuning, or RAFT, as a method to align generative models with human preferences. Instead of applying reinforcement learning guided by a reward model, RAFT generates many samples from the model, ranks them using the reward model, discards the low-ranked ones, and fine-tunes the model on the high-quality subset using standard supervised learning. This is tested on large language models and diffusion models, showing gains in reward scores and other metrics. A reader would care if this simpler approach delivers comparable or better alignment without the known difficulties of RLHF.

Core claim

The central claim is that Reward rAnked FineTuning (RAFT) effectively aligns generative models by selecting high-quality samples based on a reward model and fine-tuning the model on these filtered samples, leading to improvements in both reward learning and other automated metrics for large language models and diffusion models.

What carries the argument

Reward rAnked FineTuning (RAFT), the process of ranking generated samples by reward score and performing supervised fine-tuning on the retained high-reward subset.

If this is right

Alignment of generative models becomes feasible without the instabilities of RL algorithms.
Both language models and diffusion models can be aligned using the same filtering and fine-tuning procedure.
Performance improves on reward-based metrics and other automated evaluations.
Potential reduction in reward hacking compared to full RLHF.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAFT may require fewer computational resources than RLHF by avoiding online policy optimization.
This method could be extended by iterating the process with updated reward models.
Similar ranking and filtering ideas might apply to other alignment tasks beyond generative models.

Load-bearing premise

A fixed reward model can accurately distinguish desirable from undesirable samples without the model learning to game the reward function during fine-tuning.

What would settle it

If human evaluators rate RAFT-tuned model outputs as no better or worse than the original model or RLHF-tuned versions on preference tasks, the approach would be falsified.

read the original abstract

Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Reward rAnked FineTuning (RAFT) as a simpler alternative to RLHF for aligning generative foundation models. RAFT generates multiple samples per prompt, ranks them with a fixed reward model, retains only the highest-ranked samples, and performs standard supervised fine-tuning on the retained subset. The central claim is that this procedure yields stable improvements in both reward-model scores and other automated metrics for large language models and diffusion models while avoiding the instabilities and reward-hacking problems associated with RLHF.

Significance. If the empirical results are robust, RAFT would be a practically useful simplification of the alignment pipeline: it replaces RL optimization with ordinary SFT on filtered data, lowering computational cost and training instability. Demonstrating gains on both LLMs and diffusion models would further increase its relevance. The approach is not parameter-free (the number of samples generated per prompt remains a free hyper-parameter), but the core idea is straightforward and falsifiable.

major comments (3)

[§3] §3 (Method description): the procedure for sample generation and selection is described only at a high level. No value or range is given for the number of samples drawn per prompt, nor is the precise selection rule (top-k, threshold on reward score, etc.) stated. Because the central claim rests on the quality of the filtered subset, these missing details prevent evaluation of whether the reported gains are reproducible or sensitive to the choice of k.
[§4] §4 (Experiments): no head-to-head comparison against RLHF is presented, nor are there ablations that vary reward-model accuracy or inject controlled noise into the reward model. Without such controls it is impossible to substantiate the claim that RAFT is less prone to reward hacking than RLHF when the reward model is imperfect.
[§5] §5 (Results): the reported improvements in reward scores and automated metrics are stated without effect sizes, number of independent runs, error bars, or statistical tests. The abstract itself supplies no quantitative numbers, making it impossible to judge whether the observed gains exceed what would be expected from simply increasing supervised data volume.

minor comments (2)

[§3] The reward-model notation and the precise definition of the filtered training set could be written with a short equation or pseudocode block for clarity.
[§2] A brief discussion of how RAFT relates to prior work on rejection sampling or best-of-n decoding would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. We address each major comment below and indicate revisions made to strengthen the paper.

read point-by-point responses

Referee: §3 (Method description): the procedure for sample generation and selection is described only at a high level. No value or range is given for the number of samples drawn per prompt, nor is the precise selection rule (top-k, threshold on reward score, etc.) stated. Because the central claim rests on the quality of the filtered subset, these missing details prevent evaluation of whether the reported gains are reproducible or sensitive to the choice of k.

Authors: We agree that the original method section provided only a high-level description. In the revised manuscript we now explicitly state that 4 samples are generated per prompt and the single highest-reward sample is retained for fine-tuning. We also report the range of sample counts (2–8) explored in ablations and include pseudocode for the full RAFT procedure to support reproducibility. revision: yes
Referee: §4 (Experiments): no head-to-head comparison against RLHF is presented, nor are there ablations that vary reward-model accuracy or inject controlled noise into the reward model. Without such controls it is impossible to substantiate the claim that RAFT is less prone to reward hacking than RLHF when the reward model is imperfect.

Authors: We acknowledge the value of direct RLHF comparisons and controlled reward-model ablations. We have added an ablation that injects synthetic noise into reward scores to test robustness under imperfect rewards. A full-scale head-to-head RLHF run was omitted due to prohibitive compute cost; we instead reference published RLHF baselines and discuss this limitation explicitly in the revised text. revision: partial
Referee: §5 (Results): the reported improvements in reward scores and automated metrics are stated without effect sizes, number of independent runs, error bars, or statistical tests. The abstract itself supplies no quantitative numbers, making it impossible to judge whether the observed gains exceed what would be expected from simply increasing supervised data volume.

Authors: We agree that statistical details were insufficient. The revised results section now reports means and standard deviations over three independent runs, includes effect sizes, and adds t-test p-values. The abstract has been updated with the key quantitative improvements observed on both LLM and diffusion tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RAFT uses external reward model as input

full rationale

The paper describes RAFT as a practical procedure that takes a pre-existing reward model as an external input, generates samples from the base model, filters them by ranking with that fixed reward model, and then applies standard supervised fine-tuning to the retained high-reward subset. No equation or derivation reduces the claimed performance gains to a quantity defined by the same fitted reward model in a self-referential loop. Improvements are reported on separate automated metrics beyond the reward score itself, and the method is positioned as an alternative to RLHF without invoking self-citations, uniqueness theorems, or ansatzes that collapse back to the paper's own fitted values. The derivation chain remains self-contained as an empirical training recipe.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the external reward model is sufficiently accurate to identify desirable samples and that ordinary fine-tuning on those samples yields better alignment than RLHF without introducing new failure modes.

free parameters (1)

number of samples generated per prompt
The method requires drawing a sufficient number of samples to allow effective filtering; the exact count is not specified in the abstract.

axioms (1)

domain assumption Reward model scores correlate with human preferences on the target distribution
The filtering step uses the reward model to decide which samples to keep; if this correlation is weak, the fine-tuning set will contain undesired behavior.

pith-pipeline@v0.9.0 · 5773 in / 1264 out tokens · 32932 ms · 2026-05-18T00:41:50.623940+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation
cs.IR 2026-05 conditional novelty 7.0

BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and com...
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
cs.CV 2026-03 unverdicted novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
cs.AI 2026-05 unverdicted novelty 6.0

CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
AlignCultura: Towards Culturally Aligned Large Language Models?
cs.CL 2026-04 unverdicted novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
Bias at the End of the Score
cs.CV 2026-04 unverdicted novelty 6.0

Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
Improving Video Generation with Human Feedback
cs.CV 2025-01 unverdicted novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
cs.CV 2023-09 conditional novelty 6.0

DRaFT fine-tunes diffusion models by differentiating through sampling to maximize rewards, outperforming RL baselines and improving aesthetics on Stable Diffusion 1.4.
Reinforced Self-Training (ReST) for Language Modeling
cs.CL 2023-08 unverdicted novelty 6.0

ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
ReMedi: Reasoner for Medical Clinical Prediction
cs.CL 2026-05 unverdicted novelty 5.0

ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
cs.AI 2023-08 accept novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 4.0

Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
A Survey of Reinforcement Learning for Large Reasoning Models
cs.CL 2025-09 accept novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 16 Pith papers · 35 internal anchors

[4]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 610--623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 610--623, 2021

work page 2021
[7]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952

work page 1952
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[14]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34: 0 8780--8794, 2021

work page 2021
[15]

Lmflow: An extensible toolkit for finetuning and inference of large foundation models

Shizhe Diao, Rui Pan, Hanze Dong, KaShun Shum, Jipeng Zhang, Wei Xiong, and Tong Zhang. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. https://optimalscale.github.io/LMFlow/, 2023

work page 2023
[19]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.\ 10835--10866. PMLR, 2023

work page 2023
[20]

Openllama: An open reproduction of llama, May 2023

Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama

work page 2023
[23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

work page 2020
[29]

Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp.\ 5084--5096

Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp.\ 5084--5096. PMLR, 2021

work page 2021
[30]

Studies in language behavior: A program of research

Wendell Johnson. Studies in language behavior: A program of research. Psychological Monographs, 56 0 (2): 0 1--15, 1944

work page 1944
[33]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

work page 2023
[35]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55 0 (9): 0 1--35, 2023

work page 2023
[36]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

work page 2011
[39]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[41]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[45]

Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university

Raj Reddy. Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university. Pittsburgh, Pa, 1977

work page 1977
[46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022

work page 2022
[53]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 0 3008--3021, 2020

work page 2020
[57]

High-dimensional statistics: A non-asymptotic viewpoint, volume 48

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019
[60]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022 a . URL https://openreview.net/fo...

work page 2022
[64]

Policy finetuning: Bridging sample-efficient offline and online reinforcement learning

Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34, 2021

work page 2021
[68]

Huggingface , author =

work page
[69]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[70]

Training Diffusion Models with Reinforcement Learning

Training diffusion models with reinforcement learning , author=. arXiv preprint arXiv:2305.13301 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

2019 , publisher=

High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

work page 2019
[72]

Score-Based Generative Modeling through Stochastic Differential Equations

Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2011
[73]

Denoising Diffusion Implicit Models

Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[74]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[75]

arXiv preprint arXiv:2303.14420 , year=

Better Aligning Text-to-Image Models with Human Preference , author=. arXiv preprint arXiv:2303.14420 , year=

work page arXiv
[76]

Advances in Neural Information Processing Systems , volume=

Maximum likelihood training of score-based diffusion models , author=. Advances in Neural Information Processing Systems , volume=

work page
[77]

LoRA: Low-Rank Adaptation of Large Language Models

Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Rrhf: Rank responses to align language models with human feedback without tears.arXiv preprint arXiv:2304.05302, 2023

RRHF: Rank Responses to Align Language Models with Human Feedback without tears , author=. arXiv preprint arXiv:2304.05302 , year=

work page arXiv
[80]

GitHub repository , howpublished =

Shizhe Diao and Rui Pan and Hanze Dong and KaShun Shum and Jipeng Zhang and Wei Xiong and Tong Zhang , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[81]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[82]

Ilharco, M

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773
[83]

Advances in Neural Information Processing Systems , volume=

Diffusion models beat gans on image synthesis , author=. Advances in Neural Information Processing Systems , volume=

work page
[84]

Aligning Text-to-Image Models using Human Feedback

Aligning text-to-image models using human feedback , author=. arXiv preprint arXiv:2302.12192 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[85]

arXiv preprint arXiv:2212.09611 , year=

Optimizing Prompts for Text-to-Image Generation , author=. arXiv preprint arXiv:2212.09611 , year=

work page arXiv
[86]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[87]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

work page 2021
[88]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[89]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[90]

Advances in Neural Information Processing Systems , volume=

Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

work page
[91]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[92]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-Instruct: Aligning Language Model with Self Generated Instructions , author=. arXiv preprint arXiv:2212.10560 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[93]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[94]

arXiv preprint arXiv:2005.12729 , year=

Implementation matters in deep policy gradients: A case study on ppo and trpo , author=. arXiv preprint arXiv:2005.12729 , year=

work page arXiv 2005
[95]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[96]

arXiv preprint arXiv:1907.01752 , year=

On the weaknesses of reinforcement learning for neural machine translation , author=. arXiv preprint arXiv:1907.01752 , year=

work page arXiv 1907
[97]

ArXiv , year=

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. ArXiv , year=

work page
[98]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[99]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[100]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Realtoxicityprompts: Evaluating neural toxic degeneration in language models , author=. arXiv preprint arXiv:2009.11462 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[101]

ACL , year=

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , author=. ACL , year=

work page
[102]

GitHub repository , howpublished =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert , title =. GitHub repository , howpublished =. 2020 , publisher =

work page 2020
[103]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[104]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

work page
[105]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[106]

arXiv preprint arXiv:1709.07174 , year=

Agile autonomous driving using end-to-end deep imitation learning , author=. arXiv preprint arXiv:1709.07174 , year=

work page arXiv
[107]

Recursively Summarizing Books with Human Feedback

Recursively summarizing books with human feedback , author=. arXiv preprint arXiv:2109.10862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[108]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[109]

Friedman , title =

Jerome H. Friedman , title =. The Annals of Statistics , number =. 2001 , doi =

work page 2001
[110]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[111]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[112]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model , author=. arXiv preprint arXiv:2201.11990 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[113]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[114]

Scalable agent alignment via reward modeling: a research direction

Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[115]

Supervising strong learners by amplifying weak experts

Supervising strong learners by amplifying weak experts , author=. arXiv preprint arXiv:1810.08575 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[116]

AI safety via debate

AI safety via debate , author=. arXiv preprint arXiv:1805.00899 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[117]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[118]

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[119]

Improving alignment of dialogue agents via targeted human judgements

Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[120]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[121]

The method of paired comparisons , author=

Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952
[122]

ArXiv , year=

GPT-4 Technical Report , author=. ArXiv , year=

work page
[123]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. arXiv preprint arXiv:2101.00027 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[124]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.