Recognition: 2 theorem links
· Lean TheoremRLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Pith reviewed 2026-05-15 21:29 UTC · model grok-4.3
The pith
Reinforcement learning from AI feedback matches human feedback performance for aligning large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across summarization, helpful dialogue, and harmless dialogue tasks, RLAIF achieves comparable performance to RLHF. RLAIF can also outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy or the exact same checkpoint. Direct-RLAIF obtains rewards directly from an off-the-shelf LLM during RL without a separate reward model and outperforms canonical RLAIF.
What carries the argument
Reward model trained on AI-generated preferences that substitutes for human labels in the standard RLHF pipeline, plus direct-RLAIF that uses LLM judgments as immediate rewards.
If this is right
- Alignment of large models can proceed with far lower human annotation budgets.
- Models can improve using feedback from models of the same size or the same checkpoint.
- The RL pipeline can be simplified by removing the reward-model training stage.
- Iterative self-alignment becomes practical without repeated human data collection.
Where Pith is reading between the lines
- Fully automated alignment loops could become feasible, reducing human oversight in iterative training.
- The same substitution might work in other preference-learning settings such as robotics or code generation.
- If the quality gap closes further, human feedback could shift from primary data source to occasional validation set.
Load-bearing premise
An off-the-shelf large language model can generate preference labels that are high-quality enough to replace human judgments when training the reward model.
What would settle it
A controlled human evaluation in which users consistently prefer responses from RLHF-trained models over RLAIF-trained models on the same tasks by a clear margin.
read the original abstract
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RLAIF—training a reward model on preferences labeled by an off-the-shelf LLM—achieves performance comparable to RLHF across summarization, helpful dialogue, and harmless dialogue tasks. It further shows that RLAIF can exceed a supervised fine-tuned baseline even when the AI labeler is the same size or identical checkpoint as the policy, and introduces direct-RLAIF (d-RLAIF), which bypasses reward-model training by querying the LLM for rewards during PPO and reports superior results to canonical RLAIF.
Significance. If the empirical parity and d-RLAIF gains hold under rigorous controls, the work is significant because it directly addresses the data-scalability bottleneck of RLHF. Demonstrating that AI feedback can substitute for human preferences on both helpfulness and harmlessness, plus the self-improvement result with same-size labelers, would materially lower the cost of alignment and enable larger-scale iterative training.
major comments (3)
- [§4.2] §4.2 (harmless dialogue results): the claim of RLAIF–RLHF parity rests on the unverified assumption that the off-the-shelf LLM’s preference judgments are high-fidelity substitutes for human judgments on safety; no quantitative agreement rate, bias analysis, or error breakdown between AI and human labels is provided, so the observed parity could reflect shared model artifacts rather than true alignment.
- [§5.1] §5.1 (d-RLAIF description): bypassing the reward model by feeding LLM scores directly into PPO introduces non-stationary and potentially high-variance rewards; the manuscript reports superior performance but contains no ablation on query frequency, temperature, or reward normalization, leaving open whether gains arise from better signal or from optimization artifacts.
- [Table 3] Table 3 (human evaluation scores): margins between RLAIF and RLHF are small on helpfulness; without reported standard errors, number of annotators, or statistical significance tests, the “comparable performance” conclusion is not yet statistically supported.
minor comments (2)
- [§3.1] §3.1: the preference-loss equation is described in prose but would benefit from an explicit mathematical statement to clarify the exact training objective used for the AI-labeled reward model.
- [Figure 4] Figure 4: training curves for d-RLAIF lack error bands or multiple seeds, making stability comparisons with canonical RLAIF difficult to assess.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4.2] §4.2 (harmless dialogue results): the claim of RLAIF–RLHF parity rests on the unverified assumption that the off-the-shelf LLM’s preference judgments are high-fidelity substitutes for human judgments on safety; no quantitative agreement rate, bias analysis, or error breakdown between AI and human labels is provided, so the observed parity could reflect shared model artifacts rather than true alignment.
Authors: We agree that a direct quantitative comparison between the LLM labeler and human judgments on safety preferences would provide stronger support for the claim. Although the primary evidence for parity comes from downstream human evaluations of the trained policies (which are independent of the label source), we will add an analysis of agreement rates, bias, and error types between AI and human labels on the harmless dialogue preference data in the revised manuscript. revision: yes
-
Referee: [§5.1] §5.1 (d-RLAIF description): bypassing the reward model by feeding LLM scores directly into PPO introduces non-stationary and potentially high-variance rewards; the manuscript reports superior performance but contains no ablation on query frequency, temperature, or reward normalization, leaving open whether gains arise from better signal or from optimization artifacts.
Authors: We acknowledge that the current manuscript lacks ablations on these design choices for d-RLAIF. In the revision we will add experiments varying query frequency, temperature, and reward normalization to demonstrate that the reported gains are robust and not attributable to optimization artifacts. revision: yes
-
Referee: [Table 3] Table 3 (human evaluation scores): margins between RLAIF and RLHF are small on helpfulness; without reported standard errors, number of annotators, or statistical significance tests, the “comparable performance” conclusion is not yet statistically supported.
Authors: We agree that statistical details are necessary to support the comparability claims. We will revise Table 3 to report the number of annotators per comparison, standard errors, and results of statistical significance tests (e.g., bootstrap or paired tests) between RLAIF and RLHF conditions. revision: yes
Circularity Check
No circularity: empirical performance comparisons only
full rationale
The paper is an empirical study that trains reward models on LLM-generated preferences and reports direct experimental outcomes (win rates, human evaluations) for RLAIF versus RLHF on summarization and dialogue tasks. No equations, derivations, or predictions are claimed; results are obtained by running PPO with the respective reward signals and measuring against held-out human preferences. The central assumption (LLM preferences as viable substitutes) is tested rather than derived, and all comparisons are to external baselines, leaving the work self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-generated preferences can approximate human preferences sufficiently for alignment.
Forward citations
Cited by 18 Pith papers
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...
-
Common-agency Games for Multi-Objective Test-Time Alignment
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
-
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much lar...
-
GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
-
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate
Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTF...
-
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
-
Probably Approximately Consensus: On the Learning Theory of Finding Common Ground
Models consensus as a PAC-learnable interval in embedded 1D opinion space via ERM that maximizes expected agreement over an issue distribution.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems
OOM-RL aligns multi-agent LLM systems for software engineering by using real financial market losses as an un-hackable negative gradient, resulting in a mature-phase annualized Sharpe ratio of 2.06 via a strict test-d...
-
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
-
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[3]
E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...
work page 2022
-
[4]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[6]
F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
work page 2017
-
[8]
RAFT : Reward ranked finetuning for generative foundation model alignment
Dong, H., Xiong, W., Goyal, D., Zhang, Y., Chow, W., Pan, R., Diao, S., Zhang, J., SHUM, K., and Zhang, T. RAFT : Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY
work page 2023
-
[9]
Understanding dataset difficulty with V -usable information
Ethayarajh, K., Choi, Y., and Swayamdipta, S. Understanding dataset difficulty with V -usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 5988--6008. PMLR, 17--23 Jul 2022
work page 2022
-
[10]
Everitt, T. and Hutter, M. Avoiding wireheading with value reinforcement learning. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9, pp.\ 12--22. Springer, 2016
work page 2016
-
[15]
A theory of regularized markov decision processes
Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized markov decision processes. In International Conference on Machine Learning, pp.\ 2160--2169. PMLR, 2019
work page 2019
-
[18]
Ai platform data labeling service pricing
Google. Ai platform data labeling service pricing. https://cloud.google.com/ai-platform/data-labeling/pricing#labeling_costs, 2023. Accessed: 2023-09-28
work page 2023
-
[19]
Google, R. A., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha, J., Br...
work page 2023
-
[20]
Howard, R. A. Dynamic programming and markov processes. John Wiley, 1960
work page 1960
-
[22]
Jaques, N., Gu, S., Bahdanau, D., Hern \'a ndez-Lobato, J. M., Turner, R. E., and Eck, D. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pp.\ 1645--1654. PMLR, 2017
work page 2017
-
[24]
Kendall, M. G. and Smith, B. B. The Problem of m Rankings . The Annals of Mathematical Statistics, 10 0 (3): 0 275 -- 287, 1939. doi:10.1214/aoms/1177732186. URL https://doi.org/10.1214/aoms/1177732186
-
[25]
M., Bullard, K., and Sadigh, D
Kwon, M., Xie, S. M., Bullard, K., and Sadigh, D. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[29]
An overview of bard: an early experiment with generative ai
Manyika, J. An overview of bard: an early experiment with generative ai. https://ai.google/static/documents/google-about-bard.pdf, 2023. Accessed: 2023-08-23
work page 2023
-
[30]
Tuning language models as training data generators for augmentation-enhanced few-shot learning
Meng, Y., Michalski, M., Huang, J., Zhang, Y., Abdelzaher, T., and Han, J. Tuning language models as training data generators for augmentation-enhanced few-shot learning. In International Conference on Machine Learning, pp.\ 24457--24477. PMLR, 2023
work page 2023
-
[31]
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 11048--11064, 2022
work page 2022
- [33]
-
[34]
OpenAI. Openai pricing. https://openai.com/pricing, 2023 b . Accessed: 2023-09-28
work page 2023
-
[35]
Training language models to follow instructions with human feedback
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[37]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[41]
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 0 3008--3021, 2020
work page 2020
-
[42]
S., McAllester, D., Singh, S., and Mansour, Y
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999
work page 1999
-
[46]
Want to reduce labeling cost? gpt-3 can help
Wang, S., Liu, Y., Xu, Y., Zhu, C., and Zeng, M. Want to reduce labeling cost? gpt-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.\ 4195--4205, 2021 a
work page 2021
-
[47]
Wang, X., Wei, J., Schuurmans, D., Le, Q. V., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022 b
work page 2022
-
[49]
W., Lester, B., Du, N., Dai, A
Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021
work page 2021
-
[50]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022
work page 2022
-
[51]
Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8: 0 229--256, 1992
work page 1992
-
[52]
A study of reinforcement learning for neural machine translation
Wu, L., Tian, F., Qin, T., Lai, J., and Liu, T.-Y. A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 3612--3621, 2018
work page 2018
- [53]
-
[55]
Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023
Yang, K., Klein, D., Celikyilmaz, A., Peng, N., and Tian, Y. Rlcd: Reinforcement learning from contrast distillation for language model alignment, 2023
work page 2023
- [57]
-
[58]
Publications Manual , year = "1983", publisher =
work page 1983
-
[59]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [60]
-
[61]
Dan Gusfield , title =. 1997
work page 1997
-
[62]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[63]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[64]
Advances in Neural Information Processing Systems , volume=
Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[65]
Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=
work page 2022
-
[66]
Advances in Neural Information Processing Systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[67]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer and Mitchell Stern , title =. CoRR , volume =. 2018 , url =. 1804.04235 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[68]
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih and Adri. Asynchronous Methods for Deep Reinforcement Learning , journal =. 2016 , url =. 1602.01783 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[69]
The Eleventh International Conference on Learning Representations , year=
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
- [70]
-
[71]
Hierarchical Neural Story Generation
Fan, Angela and Lewis, Mike and Dauphin, Yann. Hierarchical Neural Story Generation. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1082
-
[72]
LaMDA: Language Models for Dialog Applications
Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[75]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[76]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
WebGPT: Browser-assisted question-answering with human feedback
Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
Proceedings of the AAAI Conference on Artificial Intelligence , pages=
Learning to extract coherent summary via deep reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=
-
[79]
Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
arXiv preprint arXiv:2307.16039 , year=
Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2307.16039 , year=
-
[81]
arXiv preprint arXiv:1907.12894 , year=
Reward learning for efficient reinforcement learning in extractive document summarisation , author=. arXiv preprint arXiv:1907.12894 , year=
-
[82]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
A Study of Reinforcement Learning for Neural Machine Translation , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2018
-
[83]
arXiv preprint arXiv:2304.01852 , year=
Summary of chatgpt/gpt-4 research and perspective towards the future of large language models , author=. arXiv preprint arXiv:2304.01852 , year=
- [84]
-
[85]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google's neural machine translation system: Bridging the gap between human and machine translation , author=. arXiv preprint arXiv:1609.08144 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[86]
An overview of Bard: an early experiment with generative AI , author=. 2023 , note =
work page 2023
- [87]
- [88]
-
[89]
Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
Want To Reduce Labeling Cost? GPT-3 Can Help , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=
work page 2021
-
[90]
arXiv preprint arXiv:2303.15056 , year=
Chatgpt outperforms crowd-workers for text-annotation tasks , author=. arXiv preprint arXiv:2303.15056 , year=
-
[91]
Is GPT -3 a Good Data Annotator?
Ding, Bosheng and Qin, Chengwei and Liu, Linlin and Chia, Yew Ken and Li, Boyang and Joty, Shafiq and Bing, Lidong. Is GPT -3 a Good Data Annotator?. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.626
-
[92]
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment , author=. 2023 , eprint=
work page 2023
-
[93]
Avoiding wireheading with value reinforcement learning , author=. Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9 , pages=. 2016 , organization=
work page 2016
-
[94]
Concrete Problems in AI Safety
Concrete problems in AI safety , author=. arXiv preprint arXiv:1606.06565 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[95]
Advances in neural information processing systems , volume=
A natural policy gradient , author=. Advances in neural information processing systems , volume=
-
[96]
arXiv preprint arXiv:2306.00186 , year=
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback , author=. arXiv preprint arXiv:2306.00186 , year=
-
[97]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[98]
The Eleventh International Conference on Learning Representations , year=
Reward Design with Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[99]
arXiv preprint arXiv:2308.11483 , year=
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions , author=. arXiv preprint arXiv:2308.11483 , year=
-
[100]
arXiv preprint arXiv:2209.12356 , year=
News summarization and evaluation in the era of gpt-3 , author=. arXiv preprint arXiv:2209.12356 , year=
-
[101]
Fine-Tuning Language Models from Human Preferences
Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[102]
International Conference on Machine Learning , pages=
Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[103]
International Conference on Machine Learning , pages=
Tuning language models as training data generators for augmentation-enhanced few-shot learning , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[104]
Self-Refine: Iterative Refinement with Self-Feedback
Self-refine: Iterative refinement with self-feedback , author=. arXiv preprint arXiv:2303.17651 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[105]
Feng, Steven Y. and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard. A Survey of Data Augmentation Approaches for NLP. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.84
-
[106]
arXiv preprint arXiv:2109.09193 , year=
Towards zero-label language learning , author=. arXiv preprint arXiv:2109.09193 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.