Recognition: 2 theorem links
· Lean TheoremSelf-Rewarding Language Models
Pith reviewed 2026-05-13 11:57 UTC · model grok-4.3
The pith
Language models can train themselves by using their own judgments to generate rewards for iterative improvement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613.
What carries the argument
LLM-as-a-Judge prompting that lets the model generate its own reward signals to drive Iterative DPO updates.
If this is right
- Instruction-following performance rises with each self-rewarding iteration.
- The model's ability to act as a judge improves alongside its generation ability.
- The final model exceeds the AlpacaEval 2.0 results of Claude 2, Gemini Pro, and GPT-4 0613.
- Models can continue improving both generation and evaluation without new human preference data.
Where Pith is reading between the lines
- Self-rewarding could let models scale training signals beyond current human data limits.
- The same loop might apply to domains such as math reasoning or code generation.
- Repeated iterations could eventually produce models whose judgments exceed typical human consistency.
Load-bearing premise
The model's self-generated judgments must be reliable enough to produce genuine capability gains instead of reinforcing its own errors or biases.
What would settle it
Run the three-iteration process on Llama 2 70B and measure whether AlpacaEval 2.0 scores stay flat or drop while judgment bias metrics rise.
read the original abstract
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Self-Rewarding Language Models in which the LLM itself, via LLM-as-a-Judge prompting, generates its own rewards for Iterative DPO training. Starting from Llama 2 70B, three iterations yield a model that outperforms Claude 2, Gemini Pro, and GPT-4 0613 on the AlpacaEval 2.0 leaderboard while also improving its own reward-generation quality.
Significance. If the self-reward signal is reliable, the approach could enable continual autonomous improvement without human feedback bottlenecks, a potentially important direction for scalable alignment. The public-benchmark results and explicit iteration protocol provide a reproducible empirical foundation.
major comments (3)
- [§4 (Experiments)] §4 (Experiments): No correlation is reported between the self-generated rewards and either human preference annotations or scores from a held-out stronger judge on the training distribution. This validation is load-bearing for the claim that iterative DPO produces genuine capability gains rather than amplification of the base model's judgment biases.
- [§3 (Method)] §3 (Method): The exact LLM-as-a-Judge prompt template is not reproduced, and no ablations or bias controls (self-preference, length, format) are presented for the reward-generation step, leaving the training signal's robustness unexamined.
- [Table 2 / §4.2] Table 2 / §4.2: The AlpacaEval 2.0 gains lack reported statistical significance, standard errors, or variance across runs, weakening the strength of the outperformance claims relative to the listed baselines.
minor comments (2)
- [Abstract] Abstract: The phrase 'many existing systems' could list the specific models outperformed for immediate clarity.
- [§5 (Discussion)] §5 (Discussion): The limitations paragraph could explicitly address the risk of self-reinforcing judgment errors.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): No correlation is reported between the self-generated rewards and either human preference annotations or scores from a held-out stronger judge on the training distribution. This validation is load-bearing for the claim that iterative DPO produces genuine capability gains rather than amplification of the base model's judgment biases.
Authors: We agree that explicit correlation analysis would strengthen the validation. In the revised manuscript we will add a new subsection reporting Pearson and Spearman correlations between the self-reward scores and a held-out GPT-4 judge on a 500-example subset of the training distribution. Human preference annotations for the precise training prompts are not available, which we will note as a limitation; however, the consistent gains on held-out benchmarks (AlpacaEval 2.0, MT-Bench) provide supporting evidence that the improvements are not solely bias amplification. revision: yes
-
Referee: [§3 (Method)] §3 (Method): The exact LLM-as-a-Judge prompt template is not reproduced, and no ablations or bias controls (self-preference, length, format) are presented for the reward-generation step, leaving the training signal's robustness unexamined.
Authors: We will reproduce the complete LLM-as-a-Judge prompt template verbatim in the appendix of the revised paper. For bias controls, we performed internal checks during development showing negligible length and format bias; we will add a short paragraph and one supplementary table summarizing these checks (self-preference was not observed to be significant). Full ablations of every bias type would require additional compute, so we treat this as a partial revision. revision: partial
-
Referee: [Table 2 / §4.2] Table 2 / §4.2: The AlpacaEval 2.0 gains lack reported statistical significance, standard errors, or variance across runs, weakening the strength of the outperformance claims relative to the listed baselines.
Authors: We acknowledge that error bars would be desirable. However, the prohibitive cost of repeating full 70B DPO training runs multiple times makes multi-seed statistics infeasible for this study. In the revision we will add an explicit limitations paragraph noting this constraint and the community norm of single-run reporting for large-scale LLM training, while also reporting prompt-level variance from the AlpacaEval 2.0 evaluator itself. revision: partial
Circularity Check
No circularity: empirical gains measured on external benchmark
full rationale
The paper presents an iterative self-rewarding training procedure (LLM-as-a-Judge prompting followed by DPO) whose effectiveness is evaluated by direct comparison of the resulting model against independent external systems on the public AlpacaEval 2.0 leaderboard. No equations, fitted parameters, or self-citations are invoked that would reduce the reported performance gains to a definitional or tautological identity with the training inputs. The self-rewarding loop is a training algorithm whose outputs are tested against held-out benchmarks and other models rather than being presupposed by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of iterations
- LLM-as-Judge prompt template
axioms (1)
- domain assumption DPO training produces reliable improvements when given preference pairs or reward signals
Forward citations
Cited by 29 Pith papers
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies
TACC algorithm for adaptive multi-fidelity bandits with improving proxies achieves instance-dependent regret by replacing logarithmic high-fidelity pulls with bounded low-fidelity continuation for intermediate arms.
-
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.
-
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
-
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...
-
Neural Garbage Collection: Learning to Forget while Learning to Reason
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
-
TextGrad: Automatic "Differentiation" via Text
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
-
Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation
MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.
-
PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs
PoliLegalLM, trained with continued pretraining, progressive SFT, and preference RL on a legal corpus, outperforms similar-scale models on LawBench, LexEval, and a real-world PoliLegal dataset while staying competitiv...
-
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.
-
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
Think you have solved question answering?
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering?
-
[4]
Natural Questions: a Benchmark for Question Answering Research , author =. 2019 , journal =
work page 2019
-
[5]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=
-
[6]
9th International Conference on Learning Representations,
Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =
work page 2021
-
[7]
Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =
-
[10]
ROUGE : A Package for Automatic Evaluation of Summaries
Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004
work page 2004
- [11]
-
[12]
arXiv preprint arXiv:1511.06709 , year=
Improving neural machine translation models with monolingual data , author=. arXiv preprint arXiv:1511.06709 , year=
-
[13]
arXiv preprint arXiv:1906.06442 , year=
Tagged back-translation , author=. arXiv preprint arXiv:1906.06442 , year=
-
[16]
QLoRA: Efficient Finetuning of Quantized LLMs
Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2304.08460 , year=
Longform: Optimizing instruction tuning for long text generation with corpus extraction , author=. arXiv preprint arXiv:2304.08460 , year=
-
[18]
Advances in Neural Information Processing Systems , volume=
Process for adapting language models to society (palms) with values-targeted datasets , author=. Advances in Neural Information Processing Systems , volume=
-
[19]
arXiv preprint arXiv:2305.11206 , year=
Lima: Less is more for alignment , author=. arXiv preprint arXiv:2305.11206 , year=
-
[20]
arXiv preprint arXiv:2305.15717 , year =
The false promise of imitating proprietary llms , author=. arXiv preprint arXiv:2305.15717 , year=
-
[21]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[23]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[24]
Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=
work page internal anchor Pith review arXiv
-
[25]
Opt-iml: Scaling language model instruction meta learning through the lens of generalization , author=. arXiv preprint arXiv:2212.12017 , year=
-
[26]
Scaling Instruction-Finetuned Language Models , author=. arXiv e-prints , pages=
-
[27]
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , author=
-
[28]
Finetuned Language Models Are Zero-Shot Learners
Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[30]
ClueWeb22: 10 billion web documents with rich information , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
-
[31]
Zhang, Xuanyu and Yang, Qing , journal=
-
[32]
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=
work page 2023
-
[33]
Chen, Lichang and Li, Shiyang and Yan, Jun and Wang, Hai and Gunaratna, Kalpa and Yadav, Vikas and Tang, Zheng and Srinivasan, Vijay and Zhou, Tianyi and Huang, Heng and others , booktitle=. 2024 , url=
work page 2024
-
[34]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask prompted training enables zero-shot task generalization , author=. arXiv preprint arXiv:2110.08207 , year=
work page internal anchor Pith review arXiv
-
[35]
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R
Cross-task generalization via natural language crowdsourcing instructions , author=. arXiv preprint arXiv:2104.08773 , year=
-
[36]
arXiv preprint arXiv:2204.07705 , year=
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks , author=. arXiv preprint arXiv:2204.07705 , year=
-
[38]
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=. Constitutional
-
[39]
Self-critiquing models for assisting human evaluators
Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=
-
[40]
Self-Refine: Iterative Refinement with Self-Feedback
Self-refine: Iterative refinement with self-feedback , author=. arXiv preprint arXiv:2303.17651 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme , year=
-
[42]
Wang, Guan and Cheng, Sijie and Yu, Qiying and Liu, Changling , doi =. 2023 , month =
work page 2023
-
[43]
Enhancing chat language models by scaling high-quality instructional conversations
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. arXiv preprint arXiv:2305.14233 , year=
-
[45]
The Curious Case of Neural Text Degeneration
The curious case of neural text degeneration , author=. arXiv preprint arXiv:1904.09751 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[46]
The Capacity for Moral Self-Correction in Large Language Models , author=. arXiv e-prints , pages=
-
[47]
arXiv preprint arXiv:2010.00133 , year=
CrowS-pairs: A challenge dataset for measuring social biases in masked language models , author=. arXiv preprint arXiv:2010.00133 , year=
-
[48]
arXiv preprint arXiv:2306.04751 , year=
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. arXiv preprint arXiv:2306.04751 , year=
-
[50]
Proceedings of the AAAI conference on artificial intelligence , volume=
Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[52]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Can a suit of armor conduct electricity? a new dataset for open book question answering , author=. arXiv preprint arXiv:1809.02789 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[54]
Xinyang Geng and Arnav Gudibande and Hao Liu and Eric Wallace and Pieter Abbeel and Sergey Levine and Dawn Song , title =. 2023 , url =
work page 2023
-
[55]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[57]
The Twelfth International Conference on Learning Representations , year=
Self-alignment with instruction backtranslation , author=. The Twelfth International Conference on Learning Representations , year=
-
[59]
Gonzalez and Ion Stoica , booktitle=
Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=
work page 2023
-
[60]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
-
[61]
Van der Maaten, Laurens and Hinton, Geoffrey , journal=. Visualizing data using
-
[63]
Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Lu, Kellie and Mesnard, Thomas and Bishop, Colton and Carbune, Victor and Rastogi, Abhinav , journal=
-
[64]
Advances in Neural Information Processing Systems , volume=
Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[72]
Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J , journal=
-
[73]
Hongyi Yuan and Zheng Yuan and Chuanqi Tan and Wei Wang and Songfang Huang and Fei Huang , booktitle=. 2023 , url=
work page 2023
-
[75]
Benchmarking Foundation Models with Language-Model-as-an-Examiner , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[78]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Proceedings of the 25th International Conference on Machine Learning , pages=
A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th International Conference on Machine Learning , pages=
-
[81]
Multitask learning , author=. Machine learning , volume=. 1997 , publisher=
work page 1997
-
[82]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[83]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[84]
The CRINGE loss: Learning what language not to model
Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The CRINGE loss: Learning what language not to model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8854--8874, Toronto, Canada...
- [85]
-
[86]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[87]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[88]
Benchmarking foundation models with language-model-as-an-examiner
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. Benchmarking foundation models with language-model-as-an-examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/for...
work page 2023
-
[89]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[90]
AlpaGasus : Training a better alpaca with fewer data
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. AlpaGasus : Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=FdVXgSJhvz
work page 2024
-
[91]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024 b
work page internal anchor Pith review arXiv 2024
-
[92]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[93]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[94]
A unified architecture for natural language processing: Deep neural networks with multitask learning
Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160--167, 2008
work page 2008
-
[95]
Alpacafarm: A simulation framework for methods that learn from human feedback
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023
-
[96]
Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, Andr \'e Martins, Graham Neubig, Ankush Garg, Jonathan Clark, Markus Freitag, and Orhan Firat. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of...
-
[97]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[98]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[99]
Unnatural instructions: Tuning language models with (almost) no human labor
Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409--14428, Toronto, Canada, July 202...
-
[100]
Prometheus: Inducing fine-grained evaluation capability in language models
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023
-
[101]
o pf, Yannic Kilcher, Dimitri von R \
Andreas K \"o pf, Yannic Kilcher, Dimitri von R \"u tte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich \'a rd Nagyfi, et al. OpenAssistant conversations--democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023
-
[102]
Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transac...
work page 2019
-
[103]
arXiv preprint arXiv:2309.00267 , year=
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. RLAIF : Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.