arxiv: 2309.00267 · v3 · submitted 2023-09-01 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee , Samrat Phatale , Hassan Mansoor , Thomas Mesnard , Johan Ferret , Kellie Lu , Colton Bishop , Ethan Hall

show 3 more authors

Victor Carbune Abhinav Rastogi Sushant Prakash

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords RLAIFRLHFAI feedbackreward modellanguage model alignmentsummarizationdialogue generationself-improvement

0 comments

The pith

Reinforcement learning from AI feedback matches human feedback performance for aligning large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training reward models on preferences labeled by an off-the-shelf large language model produces results comparable to those trained on human preferences. This holds for summarization, helpful dialogue generation, and harmless dialogue generation. The approach reduces the need for expensive human labeling, which currently limits how far alignment techniques can scale. The authors further introduce direct-RLAIF, which skips reward model training entirely by pulling rewards straight from the language model during reinforcement learning and achieves better results than the standard RLAIF setup.

Core claim

Across summarization, helpful dialogue, and harmless dialogue tasks, RLAIF achieves comparable performance to RLHF. RLAIF can also outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy or the exact same checkpoint. Direct-RLAIF obtains rewards directly from an off-the-shelf LLM during RL without a separate reward model and outperforms canonical RLAIF.

What carries the argument

Reward model trained on AI-generated preferences that substitutes for human labels in the standard RLHF pipeline, plus direct-RLAIF that uses LLM judgments as immediate rewards.

If this is right

Alignment of large models can proceed with far lower human annotation budgets.
Models can improve using feedback from models of the same size or the same checkpoint.
The RL pipeline can be simplified by removing the reward-model training stage.
Iterative self-alignment becomes practical without repeated human data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fully automated alignment loops could become feasible, reducing human oversight in iterative training.
The same substitution might work in other preference-learning settings such as robotics or code generation.
If the quality gap closes further, human feedback could shift from primary data source to occasional validation set.

Load-bearing premise

An off-the-shelf large language model can generate preference labels that are high-quality enough to replace human judgments when training the reward model.

What would settle it

A controlled human evaluation in which users consistently prefer responses from RLHF-trained models over RLAIF-trained models on the same tasks by a clear margin.

read the original abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLAIF matches RLHF on these tasks and d-RLAIF skips the reward model step, but the quality of the AI preferences is still the load-bearing assumption.

read the letter

The main thing here is that RLAIF gets comparable results to RLHF on summarization, helpful dialogue, and harmless dialogue, and the new d-RLAIF variant that pulls rewards directly from an off-the-shelf LLM during training beats the standard two-step version. They also show gains over a supervised baseline even when the AI labeler is the same size as the policy model or the exact same checkpoint. That self-improvement angle is the most practical part for scaling without needing bigger models for labeling. The comparisons across three tasks are straightforward and follow the usual RLHF pipeline with the human feedback swapped for AI labels, which makes the head-to-head clear. d-RLAIF is a clean simplification that removes reward model training, and the abstract reports it works better than canonical RLAIF. The experiments build directly on the earlier Bai et al. work without obvious circularity. The soft spot is the assumption that preferences from an off-the-shelf LLM are high enough quality to substitute for humans. If the LLM carries length bias, stylistic preferences, or weak calibration on harmlessness, the observed parity could come from shared artifacts rather than real alignment. The abstract does not show separate validation of the preference data or detailed error bars, so the support for the claims is plausible but not fully locked down yet. This paper is for alignment researchers focused on making RLHF cheaper and more scalable. It deserves a serious referee to check the experimental controls, bias analysis, and whether the downstream metrics actually reflect human judgments.

Referee Report

3 major / 2 minor

Summary. The paper claims that RLAIF—training a reward model on preferences labeled by an off-the-shelf LLM—achieves performance comparable to RLHF across summarization, helpful dialogue, and harmless dialogue tasks. It further shows that RLAIF can exceed a supervised fine-tuned baseline even when the AI labeler is the same size or identical checkpoint as the policy, and introduces direct-RLAIF (d-RLAIF), which bypasses reward-model training by querying the LLM for rewards during PPO and reports superior results to canonical RLAIF.

Significance. If the empirical parity and d-RLAIF gains hold under rigorous controls, the work is significant because it directly addresses the data-scalability bottleneck of RLHF. Demonstrating that AI feedback can substitute for human preferences on both helpfulness and harmlessness, plus the self-improvement result with same-size labelers, would materially lower the cost of alignment and enable larger-scale iterative training.

major comments (3)

[§4.2] §4.2 (harmless dialogue results): the claim of RLAIF–RLHF parity rests on the unverified assumption that the off-the-shelf LLM’s preference judgments are high-fidelity substitutes for human judgments on safety; no quantitative agreement rate, bias analysis, or error breakdown between AI and human labels is provided, so the observed parity could reflect shared model artifacts rather than true alignment.
[§5.1] §5.1 (d-RLAIF description): bypassing the reward model by feeding LLM scores directly into PPO introduces non-stationary and potentially high-variance rewards; the manuscript reports superior performance but contains no ablation on query frequency, temperature, or reward normalization, leaving open whether gains arise from better signal or from optimization artifacts.
[Table 3] Table 3 (human evaluation scores): margins between RLAIF and RLHF are small on helpfulness; without reported standard errors, number of annotators, or statistical significance tests, the “comparable performance” conclusion is not yet statistically supported.

minor comments (2)

[§3.1] §3.1: the preference-loss equation is described in prose but would benefit from an explicit mathematical statement to clarify the exact training objective used for the AI-labeled reward model.
[Figure 4] Figure 4: training curves for d-RLAIF lack error bands or multiple seeds, making stability comparisons with canonical RLAIF difficult to assess.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (harmless dialogue results): the claim of RLAIF–RLHF parity rests on the unverified assumption that the off-the-shelf LLM’s preference judgments are high-fidelity substitutes for human judgments on safety; no quantitative agreement rate, bias analysis, or error breakdown between AI and human labels is provided, so the observed parity could reflect shared model artifacts rather than true alignment.

Authors: We agree that a direct quantitative comparison between the LLM labeler and human judgments on safety preferences would provide stronger support for the claim. Although the primary evidence for parity comes from downstream human evaluations of the trained policies (which are independent of the label source), we will add an analysis of agreement rates, bias, and error types between AI and human labels on the harmless dialogue preference data in the revised manuscript. revision: yes
Referee: [§5.1] §5.1 (d-RLAIF description): bypassing the reward model by feeding LLM scores directly into PPO introduces non-stationary and potentially high-variance rewards; the manuscript reports superior performance but contains no ablation on query frequency, temperature, or reward normalization, leaving open whether gains arise from better signal or from optimization artifacts.

Authors: We acknowledge that the current manuscript lacks ablations on these design choices for d-RLAIF. In the revision we will add experiments varying query frequency, temperature, and reward normalization to demonstrate that the reported gains are robust and not attributable to optimization artifacts. revision: yes
Referee: [Table 3] Table 3 (human evaluation scores): margins between RLAIF and RLHF are small on helpfulness; without reported standard errors, number of annotators, or statistical significance tests, the “comparable performance” conclusion is not yet statistically supported.

Authors: We agree that statistical details are necessary to support the comparability claims. We will revise Table 3 to report the number of annotators per comparison, standard errors, and results of statistical significance tests (e.g., bootstrap or paired tests) between RLAIF and RLHF conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance comparisons only

full rationale

The paper is an empirical study that trains reward models on LLM-generated preferences and reports direct experimental outcomes (win rates, human evaluations) for RLAIF versus RLHF on summarization and dialogue tasks. No equations, derivations, or predictions are claimed; results are obtained by running PPO with the respective reward signals and measuring against held-out human preferences. The central assumption (LLM preferences as viable substitutes) is tested rather than derived, and all comparisons are to external baselines, leaving the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the empirical validity of AI feedback substituting for human feedback, with no new theoretical axioms or invented entities.

axioms (1)

domain assumption LLM-generated preferences can approximate human preferences sufficiently for alignment.
This is the core assumption tested in the experiments.

pith-pipeline@v0.9.0 · 5541 in / 1099 out tokens · 33369 ms · 2026-05-15T21:29:18.481786+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Personalizing Text-to-Image Generation to Individual Taste
cs.CV 2026-04 unverdicted novelty 7.0

PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
cs.CV 2026-05 unverdicted novelty 6.0

BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...
Common-agency Games for Multi-Objective Test-Time Alignment
cs.GT 2026-05 unverdicted novelty 6.0

CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 6.0

WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...
GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
cs.CL 2026-04 unverdicted novelty 6.0

GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate
cs.OS 2026-04 unverdicted novelty 6.0

Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTF...
HybridFlow: A Flexible and Efficient RLHF Framework
cs.LG 2024-09 unverdicted novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
cs.LG 2024-01 unverdicted novelty 6.0

SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
Probably Approximately Consensus: On the Learning Theory of Finding Common Ground
cs.LG 2026-04 unverdicted novelty 5.0

Models consensus as a PAC-learnable interval in embedded 1D opinion space via ERM that maximizes expected agreement over an issue distribution.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 5.0

OOM-RL aligns multi-agent LLM systems for software engineering by using real financial market losses as an un-hackable negative gradient, resulting in a mature-phase annualized Sharpe ratio of 2.06 via a strict test-d...
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications
cs.CL 2026-05 unverdicted novelty 4.0

RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
cs.SE 2026-05 unverdicted novelty 4.0

ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
A Survey on Large Language Models for Code Generation
cs.CL 2024-06 unverdicted novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 18 Pith papers · 14 internal anchors

[3]

E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...

work page 2022
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[6]

F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017
[8]

RAFT : Reward ranked finetuning for generative foundation model alignment

Dong, H., Xiong, W., Goyal, D., Zhang, Y., Chow, W., Pan, R., Diao, S., Zhang, J., SHUM, K., and Zhang, T. RAFT : Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY

work page 2023
[9]

Understanding dataset difficulty with V -usable information

Ethayarajh, K., Choi, Y., and Swayamdipta, S. Understanding dataset difficulty with V -usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 5988--6008. PMLR, 17--23 Jul 2022

work page 2022
[10]

and Hutter, M

Everitt, T. and Hutter, M. Avoiding wireheading with value reinforcement learning. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9, pp.\ 12--22. Springer, 2016

work page 2016
[15]

A theory of regularized markov decision processes

Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized markov decision processes. In International Conference on Machine Learning, pp.\ 2160--2169. PMLR, 2019

work page 2019
[18]

Ai platform data labeling service pricing

Google. Ai platform data labeling service pricing. https://cloud.google.com/ai-platform/data-labeling/pricing#labeling_costs, 2023. Accessed: 2023-09-28

work page 2023
[19]

A., Dai, A

Google, R. A., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha, J., Br...

work page 2023
[20]

Howard, R. A. Dynamic programming and markov processes. John Wiley, 1960

work page 1960
[22]

M., Turner, R

Jaques, N., Gu, S., Bahdanau, D., Hern \'a ndez-Lobato, J. M., Turner, R. E., and Eck, D. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pp.\ 1645--1654. PMLR, 2017

work page 2017
[24]

Kendall, M. G. and Smith, B. B. The Problem of m Rankings . The Annals of Mathematical Statistics, 10 0 (3): 0 275 -- 287, 1939. doi:10.1214/aoms/1177732186. URL https://doi.org/10.1214/aoms/1177732186

work page doi:10.1214/aoms/1177732186 1939
[25]

M., Bullard, K., and Sadigh, D

Kwon, M., Xie, S. M., Bullard, K., and Sadigh, D. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[29]

An overview of bard: an early experiment with generative ai

Manyika, J. An overview of bard: an early experiment with generative ai. https://ai.google/static/documents/google-about-bard.pdf, 2023. Accessed: 2023-08-23

work page 2023
[30]

Tuning language models as training data generators for augmentation-enhanced few-shot learning

Meng, Y., Michalski, M., Huang, J., Zhang, Y., Abdelzaher, T., and Han, J. Tuning language models as training data generators for augmentation-enhanced few-shot learning. In International Conference on Machine Learning, pp.\ 24457--24477. PMLR, 2023

work page 2023
[31]

Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 11048--11064, 2022

work page 2022
[33]

Gpt-4 technical report, 2023 a

OpenAI. Gpt-4 technical report, 2023 a

work page 2023
[34]

Openai pricing

OpenAI. Openai pricing. https://openai.com/pricing, 2023 b . Accessed: 2023-09-28

work page 2023
[35]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[37]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[41]

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 0 3008--3021, 2020

work page 2020
[42]

S., McAllester, D., Singh, S., and Mansour, Y

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

work page 1999
[46]

Want to reduce labeling cost? gpt-3 can help

Wang, S., Liu, Y., Xu, Y., Zhu, C., and Zeng, M. Want to reduce labeling cost? gpt-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.\ 4195--4205, 2021 a

work page 2021
[47]

V., Chi, E

Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022 b

work page 2022
[49]

W., Lester, B., Du, N., Dai, A

Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021

work page 2021
[50]

V., Zhou, D., et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022

work page 2022
[51]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992

work page 1992
[52]

A study of reinforcement learning for neural machine translation

Wu, L., Tian, F., Qin, T., Lai, J., and Liu, T.-Y. A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 3612--3621, 2018

work page 2018
[53]

and Hu, B

Wu, Y. and Hu, B. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 5602, 2018

work page 2018
[55]

Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023

Yang, K., Klein, D., Celikyilmaz, A., Peng, N., and Tian, Y. Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023

work page 2023
[57]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[58]

Publications Manual , year = "1983", publisher =

work page 1983
[59]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[60]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[61]

Dan Gusfield , title =. 1997

work page 1997
[62]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[63]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[64]

Advances in Neural Information Processing Systems , volume=

Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[65]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

work page 2022
[66]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[67]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern , title =. CoRR , volume =. 2018 , url =. 1804.04235 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih and Adri. Asynchronous Methods for Deep Reinforcement Learning , journal =. 2016 , url =. 1602.01783 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[69]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[70]

2023 , eprint=

PaLM 2 Technical Report , author=. 2023 , eprint=

work page 2023
[71]

Hierarchical Neural Story Generation

Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082

work page doi:10.18653/v1/p18-1082 2018
[72]

LaMDA: Language Models for Dialog Applications

Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[75]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[76]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Learning to extract coherent summary via deep reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

work page
[79]

Improving alignment of dialogue agents via targeted human judgements

Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

arXiv preprint arXiv:2307.16039 , year=

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2307.16039 , year=

work page arXiv
[81]

arXiv preprint arXiv:1907.12894 , year=

Reward learning for efficient reinforcement learning in extractive document summarisation , author=. arXiv preprint arXiv:1907.12894 , year=

work page arXiv 1907
[82]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

A Study of Reinforcement Learning for Neural Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[83]

arXiv preprint arXiv:2304.01852 , year=

Summary of chatgpt/gpt-4 research and perspective towards the future of large language models , author=. arXiv preprint arXiv:2304.01852 , year=

work page arXiv
[84]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[85]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

2023 , note =

An overview of Bard: an early experiment with generative AI , author=. 2023 , note =

work page 2023
[87]

2023 , note =

OpenAI Pricing , author =. 2023 , note =

work page 2023
[88]

2023 , note =

AI Platform Data Labeling Service pricing , author =. 2023 , note =

work page 2023
[89]

Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

Want To Reduce Labeling Cost? GPT-3 Can Help , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

work page 2021
[90]

arXiv preprint arXiv:2303.15056 , year=

Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=

work page arXiv
[91]

Is GPT -3 a Good Data Annotator?

Ding, Bosheng and Qin, Chengwei and Liu, Linlin and Chia, Yew Ken and Li, Boyang and Joty, Shafiq and Bing, Lidong. Is GPT -3 a Good Data Annotator?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.626

work page doi:10.18653/v1/2023.acl-long.626 2023
[92]

2023 , eprint=

RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment , author=. 2023 , eprint=

work page 2023
[93]

Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9 , pages=

Avoiding wireheading with value reinforcement learning , author=. Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9 , pages=. 2016 , organization=

work page 2016
[94]

Concrete Problems in AI Safety

Concrete problems in AI safety , author=. arXiv preprint arXiv:1606.06565 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[95]

Advances in neural information processing systems , volume=

A natural policy gradient , author=. Advances in neural information processing systems , volume=

work page
[96]

arXiv preprint arXiv:2306.00186 , year=

Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback , author=. arXiv preprint arXiv:2306.00186 , year=

work page arXiv
[97]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[98]

The Eleventh International Conference on Learning Representations , year=

Reward Design with Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[99]

arXiv preprint arXiv:2308.11483 , year=

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , author=. arXiv preprint arXiv:2308.11483 , year=

work page arXiv
[100]

arXiv preprint arXiv:2209.12356 , year=

News summarization and evaluation in the era of gpt-3 , author=. arXiv preprint arXiv:2209.12356 , year=

work page arXiv
[101]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[102]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[103]

International Conference on Machine Learning , pages=

Tuning language models as training data generators for augmentation-enhanced few-shot learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[104]

Self-Refine: Iterative Refinement with Self-Feedback

Self-refine: Iterative refinement with self-feedback , author=. arXiv preprint arXiv:2303.17651 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[105]

and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard

Feng, Steven Y. and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard. A Survey of Data Augmentation Approaches for NLP. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.84

work page doi:10.18653/v1/2021.findings-acl.84 2021
[106]

arXiv preprint arXiv:2109.09193 , year=

Towards zero-label language learning , author=. arXiv preprint arXiv:2109.09193 , year=

work page arXiv

Showing first 80 references.