pith. machine review for the scientific record. sign in

arxiv: 2309.00267 · v3 · submitted 2023-09-01 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords RLAIFRLHFAI feedbackreward modellanguage model alignmentsummarizationdialogue generationself-improvement
0
0 comments X

The pith

Reinforcement learning from AI feedback matches human feedback performance for aligning large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that training reward models on preferences labeled by an off-the-shelf large language model produces results comparable to those trained on human preferences. This holds for summarization, helpful dialogue generation, and harmless dialogue generation. The approach reduces the need for expensive human labeling, which currently limits how far alignment techniques can scale. The authors further introduce direct-RLAIF, which skips reward model training entirely by pulling rewards straight from the language model during reinforcement learning and achieves better results than the standard RLAIF setup.

Core claim

Across summarization, helpful dialogue, and harmless dialogue tasks, RLAIF achieves comparable performance to RLHF. RLAIF can also outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy or the exact same checkpoint. Direct-RLAIF obtains rewards directly from an off-the-shelf LLM during RL without a separate reward model and outperforms canonical RLAIF.

What carries the argument

Reward model trained on AI-generated preferences that substitutes for human labels in the standard RLHF pipeline, plus direct-RLAIF that uses LLM judgments as immediate rewards.

If this is right

  • Alignment of large models can proceed with far lower human annotation budgets.
  • Models can improve using feedback from models of the same size or the same checkpoint.
  • The RL pipeline can be simplified by removing the reward-model training stage.
  • Iterative self-alignment becomes practical without repeated human data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fully automated alignment loops could become feasible, reducing human oversight in iterative training.
  • The same substitution might work in other preference-learning settings such as robotics or code generation.
  • If the quality gap closes further, human feedback could shift from primary data source to occasional validation set.

Load-bearing premise

An off-the-shelf large language model can generate preference labels that are high-quality enough to replace human judgments when training the reward model.

What would settle it

A controlled human evaluation in which users consistently prefer responses from RLHF-trained models over RLAIF-trained models on the same tasks by a clear margin.

read the original abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that RLAIF—training a reward model on preferences labeled by an off-the-shelf LLM—achieves performance comparable to RLHF across summarization, helpful dialogue, and harmless dialogue tasks. It further shows that RLAIF can exceed a supervised fine-tuned baseline even when the AI labeler is the same size or identical checkpoint as the policy, and introduces direct-RLAIF (d-RLAIF), which bypasses reward-model training by querying the LLM for rewards during PPO and reports superior results to canonical RLAIF.

Significance. If the empirical parity and d-RLAIF gains hold under rigorous controls, the work is significant because it directly addresses the data-scalability bottleneck of RLHF. Demonstrating that AI feedback can substitute for human preferences on both helpfulness and harmlessness, plus the self-improvement result with same-size labelers, would materially lower the cost of alignment and enable larger-scale iterative training.

major comments (3)
  1. [§4.2] §4.2 (harmless dialogue results): the claim of RLAIF–RLHF parity rests on the unverified assumption that the off-the-shelf LLM’s preference judgments are high-fidelity substitutes for human judgments on safety; no quantitative agreement rate, bias analysis, or error breakdown between AI and human labels is provided, so the observed parity could reflect shared model artifacts rather than true alignment.
  2. [§5.1] §5.1 (d-RLAIF description): bypassing the reward model by feeding LLM scores directly into PPO introduces non-stationary and potentially high-variance rewards; the manuscript reports superior performance but contains no ablation on query frequency, temperature, or reward normalization, leaving open whether gains arise from better signal or from optimization artifacts.
  3. [Table 3] Table 3 (human evaluation scores): margins between RLAIF and RLHF are small on helpfulness; without reported standard errors, number of annotators, or statistical significance tests, the “comparable performance” conclusion is not yet statistically supported.
minor comments (2)
  1. [§3.1] §3.1: the preference-loss equation is described in prose but would benefit from an explicit mathematical statement to clarify the exact training objective used for the AI-labeled reward model.
  2. [Figure 4] Figure 4: training curves for d-RLAIF lack error bands or multiple seeds, making stability comparisons with canonical RLAIF difficult to assess.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (harmless dialogue results): the claim of RLAIF–RLHF parity rests on the unverified assumption that the off-the-shelf LLM’s preference judgments are high-fidelity substitutes for human judgments on safety; no quantitative agreement rate, bias analysis, or error breakdown between AI and human labels is provided, so the observed parity could reflect shared model artifacts rather than true alignment.

    Authors: We agree that a direct quantitative comparison between the LLM labeler and human judgments on safety preferences would provide stronger support for the claim. Although the primary evidence for parity comes from downstream human evaluations of the trained policies (which are independent of the label source), we will add an analysis of agreement rates, bias, and error types between AI and human labels on the harmless dialogue preference data in the revised manuscript. revision: yes

  2. Referee: [§5.1] §5.1 (d-RLAIF description): bypassing the reward model by feeding LLM scores directly into PPO introduces non-stationary and potentially high-variance rewards; the manuscript reports superior performance but contains no ablation on query frequency, temperature, or reward normalization, leaving open whether gains arise from better signal or from optimization artifacts.

    Authors: We acknowledge that the current manuscript lacks ablations on these design choices for d-RLAIF. In the revision we will add experiments varying query frequency, temperature, and reward normalization to demonstrate that the reported gains are robust and not attributable to optimization artifacts. revision: yes

  3. Referee: [Table 3] Table 3 (human evaluation scores): margins between RLAIF and RLHF are small on helpfulness; without reported standard errors, number of annotators, or statistical significance tests, the “comparable performance” conclusion is not yet statistically supported.

    Authors: We agree that statistical details are necessary to support the comparability claims. We will revise Table 3 to report the number of annotators per comparison, standard errors, and results of statistical significance tests (e.g., bootstrap or paired tests) between RLAIF and RLHF conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance comparisons only

full rationale

The paper is an empirical study that trains reward models on LLM-generated preferences and reports direct experimental outcomes (win rates, human evaluations) for RLAIF versus RLHF on summarization and dialogue tasks. No equations, derivations, or predictions are claimed; results are obtained by running PPO with the respective reward signals and measuring against held-out human preferences. The central assumption (LLM preferences as viable substitutes) is tested rather than derived, and all comparisons are to external baselines, leaving the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the empirical validity of AI feedback substituting for human feedback, with no new theoretical axioms or invented entities.

axioms (1)
  • domain assumption LLM-generated preferences can approximate human preferences sufficiently for alignment.
    This is the core assumption tested in the experiments.

pith-pipeline@v0.9.0 · 5541 in / 1099 out tokens · 33369 ms · 2026-05-15T21:29:18.481786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  2. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  3. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  4. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  5. BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

    cs.CV 2026-05 unverdicted novelty 6.0

    BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...

  6. Common-agency Games for Multi-Objective Test-Time Alignment

    cs.GT 2026-05 unverdicted novelty 6.0

    CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.

  7. WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 6.0

    WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...

  8. GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification

    cs.CL 2026-04 unverdicted novelty 6.0

    GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.

  9. Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

    cs.OS 2026-04 unverdicted novelty 6.0

    Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTF...

  10. HybridFlow: A Flexible and Efficient RLHF Framework

    cs.LG 2024-09 unverdicted novelty 6.0

    HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

  11. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    cs.LG 2024-01 unverdicted novelty 6.0

    SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...

  12. Probably Approximately Consensus: On the Learning Theory of Finding Common Ground

    cs.LG 2026-04 unverdicted novelty 5.0

    Models consensus as a PAC-learnable interval in embedded 1D opinion space via ERM that maximizes expected agreement over an issue distribution.

  13. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

  14. OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 5.0

    OOM-RL aligns multi-agent LLM systems for software engineering by using real financial market losses as an un-hackable negative gradient, resulting in a mature-phase annualized Sharpe ratio of 2.06 via a strict test-d...

  15. Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications

    cs.CL 2026-05 unverdicted novelty 4.0

    RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.

  16. ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

    cs.SE 2026-05 unverdicted novelty 4.0

    ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.

  17. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

  18. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 18 Pith papers · 14 internal anchors

  1. [3]

    E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S

    Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...

  2. [4]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  3. [6]

    F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

    Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  4. [8]

    RAFT : Reward ranked finetuning for generative foundation model alignment

    Dong, H., Xiong, W., Goyal, D., Zhang, Y., Chow, W., Pan, R., Diao, S., Zhang, J., SHUM, K., and Zhang, T. RAFT : Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY

  5. [9]

    Understanding dataset difficulty with V -usable information

    Ethayarajh, K., Choi, Y., and Swayamdipta, S. Understanding dataset difficulty with V -usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 5988--6008. PMLR, 17--23 Jul 2022

  6. [10]

    and Hutter, M

    Everitt, T. and Hutter, M. Avoiding wireheading with value reinforcement learning. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9, pp.\ 12--22. Springer, 2016

  7. [15]

    A theory of regularized markov decision processes

    Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized markov decision processes. In International Conference on Machine Learning, pp.\ 2160--2169. PMLR, 2019

  8. [18]

    Ai platform data labeling service pricing

    Google. Ai platform data labeling service pricing. https://cloud.google.com/ai-platform/data-labeling/pricing#labeling_costs, 2023. Accessed: 2023-09-28

  9. [19]

    A., Dai, A

    Google, R. A., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha, J., Br...

  10. [20]

    Howard, R. A. Dynamic programming and markov processes. John Wiley, 1960

  11. [22]

    M., Turner, R

    Jaques, N., Gu, S., Bahdanau, D., Hern \'a ndez-Lobato, J. M., Turner, R. E., and Eck, D. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pp.\ 1645--1654. PMLR, 2017

  12. [24]

    Kendall, M. G. and Smith, B. B. The Problem of m Rankings . The Annals of Mathematical Statistics, 10 0 (3): 0 275 -- 287, 1939. doi:10.1214/aoms/1177732186. URL https://doi.org/10.1214/aoms/1177732186

  13. [25]

    M., Bullard, K., and Sadigh, D

    Kwon, M., Xie, S. M., Bullard, K., and Sadigh, D. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2022

  14. [29]

    An overview of bard: an early experiment with generative ai

    Manyika, J. An overview of bard: an early experiment with generative ai. https://ai.google/static/documents/google-about-bard.pdf, 2023. Accessed: 2023-08-23

  15. [30]

    Tuning language models as training data generators for augmentation-enhanced few-shot learning

    Meng, Y., Michalski, M., Huang, J., Zhang, Y., Abdelzaher, T., and Han, J. Tuning language models as training data generators for augmentation-enhanced few-shot learning. In International Conference on Machine Learning, pp.\ 24457--24477. PMLR, 2023

  16. [31]

    Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 11048--11064, 2022

  17. [33]

    Gpt-4 technical report, 2023 a

    OpenAI. Gpt-4 technical report, 2023 a

  18. [34]

    Openai pricing

    OpenAI. Openai pricing. https://openai.com/pricing, 2023 b . Accessed: 2023-09-28

  19. [35]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  20. [37]

    D., Ermon, S., and Finn, C

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  21. [41]

    Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 0 3008--3021, 2020

  22. [42]

    S., McAllester, D., Singh, S., and Mansour, Y

    Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999

  23. [46]

    Want to reduce labeling cost? gpt-3 can help

    Wang, S., Liu, Y., Xu, Y., Zhu, C., and Zeng, M. Want to reduce labeling cost? gpt-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.\ 4195--4205, 2021 a

  24. [47]

    V., Chi, E

    Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022 b

  25. [49]

    W., Lester, B., Du, N., Dai, A

    Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021

  26. [50]

    V., Zhou, D., et al

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022

  27. [51]

    Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992

  28. [52]

    A study of reinforcement learning for neural machine translation

    Wu, L., Tian, F., Qin, T., Lai, J., and Liu, T.-Y. A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 3612--3621, 2018

  29. [53]

    and Hu, B

    Wu, Y. and Hu, B. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 5602, 2018

  30. [55]

    Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023

    Yang, K., Klein, D., Celikyilmaz, A., Peng, N., and Tian, Y. Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023

  31. [57]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  32. [58]

    Publications Manual , year = "1983", publisher =

  33. [59]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  34. [60]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  35. [61]

    Dan Gusfield , title =. 1997

  36. [62]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  37. [63]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  38. [64]

    Advances in Neural Information Processing Systems , volume=

    Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  39. [65]

    2022 , eprint=

    Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

  40. [66]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  41. [67]

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    Noam Shazeer and Mitchell Stern , title =. CoRR , volume =. 2018 , url =. 1804.04235 , timestamp =

  42. [68]

    Asynchronous Methods for Deep Reinforcement Learning

    Volodymyr Mnih and Adri. Asynchronous Methods for Deep Reinforcement Learning , journal =. 2016 , url =. 1602.01783 , timestamp =

  43. [69]

    The Eleventh International Conference on Learning Representations , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  44. [70]

    2023 , eprint=

    PaLM 2 Technical Report , author=. 2023 , eprint=

  45. [71]

    Hierarchical Neural Story Generation

    Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082

  46. [72]

    LaMDA: Language Models for Dialog Applications

    Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

  47. [73]

    PaLM: Scaling Language Modeling with Pathways

    Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

  48. [74]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  49. [75]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  50. [76]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  51. [77]

    WebGPT: Browser-assisted question-answering with human feedback

    Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

  52. [78]

    Proceedings of the AAAI Conference on Artificial Intelligence , pages=

    Learning to extract coherent summary via deep reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

  53. [79]

    Improving alignment of dialogue agents via targeted human judgements

    Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=

  54. [80]

    arXiv preprint arXiv:2307.16039 , year=

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2307.16039 , year=

  55. [81]

    arXiv preprint arXiv:1907.12894 , year=

    Reward learning for efficient reinforcement learning in extractive document summarisation , author=. arXiv preprint arXiv:1907.12894 , year=

  56. [82]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    A Study of Reinforcement Learning for Neural Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  57. [83]

    arXiv preprint arXiv:2304.01852 , year=

    Summary of chatgpt/gpt-4 research and perspective towards the future of large language models , author=. arXiv preprint arXiv:2304.01852 , year=

  58. [84]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  59. [85]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=

  60. [86]

    2023 , note =

    An overview of Bard: an early experiment with generative AI , author=. 2023 , note =

  61. [87]

    2023 , note =

    OpenAI Pricing , author =. 2023 , note =

  62. [88]

    2023 , note =

    AI Platform Data Labeling Service pricing , author =. 2023 , note =

  63. [89]

    Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

    Want To Reduce Labeling Cost? GPT-3 Can Help , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

  64. [90]

    arXiv preprint arXiv:2303.15056 , year=

    Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=

  65. [91]

    Is GPT -3 a Good Data Annotator?

    Ding, Bosheng and Qin, Chengwei and Liu, Linlin and Chia, Yew Ken and Li, Boyang and Joty, Shafiq and Bing, Lidong. Is GPT -3 a Good Data Annotator?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.626

  66. [92]

    2023 , eprint=

    RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment , author=. 2023 , eprint=

  67. [93]

    Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9 , pages=

    Avoiding wireheading with value reinforcement learning , author=. Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9 , pages=. 2016 , organization=

  68. [94]

    Concrete Problems in AI Safety

    Concrete problems in AI safety , author=. arXiv preprint arXiv:1606.06565 , year=

  69. [95]

    Advances in neural information processing systems , volume=

    A natural policy gradient , author=. Advances in neural information processing systems , volume=

  70. [96]

    arXiv preprint arXiv:2306.00186 , year=

    Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback , author=. arXiv preprint arXiv:2306.00186 , year=

  71. [97]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  72. [98]

    The Eleventh International Conference on Learning Representations , year=

    Reward Design with Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  73. [99]

    arXiv preprint arXiv:2308.11483 , year=

    Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , author=. arXiv preprint arXiv:2308.11483 , year=

  74. [100]

    arXiv preprint arXiv:2209.12356 , year=

    News summarization and evaluation in the era of gpt-3 , author=. arXiv preprint arXiv:2209.12356 , year=

  75. [101]

    Fine-Tuning Language Models from Human Preferences

    Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

  76. [102]

    International Conference on Machine Learning , pages=

    Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  77. [103]

    International Conference on Machine Learning , pages=

    Tuning language models as training data generators for augmentation-enhanced few-shot learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  78. [104]

    Self-Refine: Iterative Refinement with Self-Feedback

    Self-refine: Iterative refinement with self-feedback , author=. arXiv preprint arXiv:2303.17651 , year=

  79. [105]

    and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard

    Feng, Steven Y. and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard. A Survey of Data Augmentation Approaches for NLP. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.84

  80. [106]

    arXiv preprint arXiv:2109.09193 , year=

    Towards zero-label language learning , author=. arXiv preprint arXiv:2109.09193 , year=

Showing first 80 references.