pith. machine review for the scientific record. sign in

arxiv: 2401.10020 · v3 · submitted 2024-01-18 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Self-Rewarding Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 11:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords self-rewarding language modelsLLM-as-a-Judgeiterative DPOinstruction followingpreference optimizationAlpacaEvalself-improvement
0
0 comments X

The pith

Language models can train themselves by using their own judgments to generate rewards for iterative improvement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates training language models where the model itself acts as the source of rewards instead of relying on separate human-labeled data. This self-rewarding process uses the model to score its own outputs via prompting, then applies those scores in direct preference optimization over multiple rounds. The approach is tested by starting with Llama 2 70B and running three iterations, during which both the model's instruction-following quality and its ability to judge responses improve together. The final model reaches higher scores on the AlpacaEval 2.0 leaderboard than several established systems. This setup suggests models could keep advancing their capabilities without new external feedback.

Core claim

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613.

What carries the argument

LLM-as-a-Judge prompting that lets the model generate its own reward signals to drive Iterative DPO updates.

If this is right

  • Instruction-following performance rises with each self-rewarding iteration.
  • The model's ability to act as a judge improves alongside its generation ability.
  • The final model exceeds the AlpacaEval 2.0 results of Claude 2, Gemini Pro, and GPT-4 0613.
  • Models can continue improving both generation and evaluation without new human preference data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Self-rewarding could let models scale training signals beyond current human data limits.
  • The same loop might apply to domains such as math reasoning or code generation.
  • Repeated iterations could eventually produce models whose judgments exceed typical human consistency.

Load-bearing premise

The model's self-generated judgments must be reliable enough to produce genuine capability gains instead of reinforcing its own errors or biases.

What would settle it

Run the three-iteration process on Llama 2 70B and measure whether AlpacaEval 2.0 scores stay flat or drop while judgment bias metrics rise.

read the original abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Self-Rewarding Language Models in which the LLM itself, via LLM-as-a-Judge prompting, generates its own rewards for Iterative DPO training. Starting from Llama 2 70B, three iterations yield a model that outperforms Claude 2, Gemini Pro, and GPT-4 0613 on the AlpacaEval 2.0 leaderboard while also improving its own reward-generation quality.

Significance. If the self-reward signal is reliable, the approach could enable continual autonomous improvement without human feedback bottlenecks, a potentially important direction for scalable alignment. The public-benchmark results and explicit iteration protocol provide a reproducible empirical foundation.

major comments (3)
  1. [§4 (Experiments)] §4 (Experiments): No correlation is reported between the self-generated rewards and either human preference annotations or scores from a held-out stronger judge on the training distribution. This validation is load-bearing for the claim that iterative DPO produces genuine capability gains rather than amplification of the base model's judgment biases.
  2. [§3 (Method)] §3 (Method): The exact LLM-as-a-Judge prompt template is not reproduced, and no ablations or bias controls (self-preference, length, format) are presented for the reward-generation step, leaving the training signal's robustness unexamined.
  3. [Table 2 / §4.2] Table 2 / §4.2: The AlpacaEval 2.0 gains lack reported statistical significance, standard errors, or variance across runs, weakening the strength of the outperformance claims relative to the listed baselines.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'many existing systems' could list the specific models outperformed for immediate clarity.
  2. [§5 (Discussion)] §5 (Discussion): The limitations paragraph could explicitly address the risk of self-reinforcing judgment errors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): No correlation is reported between the self-generated rewards and either human preference annotations or scores from a held-out stronger judge on the training distribution. This validation is load-bearing for the claim that iterative DPO produces genuine capability gains rather than amplification of the base model's judgment biases.

    Authors: We agree that explicit correlation analysis would strengthen the validation. In the revised manuscript we will add a new subsection reporting Pearson and Spearman correlations between the self-reward scores and a held-out GPT-4 judge on a 500-example subset of the training distribution. Human preference annotations for the precise training prompts are not available, which we will note as a limitation; however, the consistent gains on held-out benchmarks (AlpacaEval 2.0, MT-Bench) provide supporting evidence that the improvements are not solely bias amplification. revision: yes

  2. Referee: [§3 (Method)] §3 (Method): The exact LLM-as-a-Judge prompt template is not reproduced, and no ablations or bias controls (self-preference, length, format) are presented for the reward-generation step, leaving the training signal's robustness unexamined.

    Authors: We will reproduce the complete LLM-as-a-Judge prompt template verbatim in the appendix of the revised paper. For bias controls, we performed internal checks during development showing negligible length and format bias; we will add a short paragraph and one supplementary table summarizing these checks (self-preference was not observed to be significant). Full ablations of every bias type would require additional compute, so we treat this as a partial revision. revision: partial

  3. Referee: [Table 2 / §4.2] Table 2 / §4.2: The AlpacaEval 2.0 gains lack reported statistical significance, standard errors, or variance across runs, weakening the strength of the outperformance claims relative to the listed baselines.

    Authors: We acknowledge that error bars would be desirable. However, the prohibitive cost of repeating full 70B DPO training runs multiple times makes multi-seed statistics infeasible for this study. In the revision we will add an explicit limitations paragraph noting this constraint and the community norm of single-run reporting for large-scale LLM training, while also reporting prompt-level variance from the AlpacaEval 2.0 evaluator itself. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmark

full rationale

The paper presents an iterative self-rewarding training procedure (LLM-as-a-Judge prompting followed by DPO) whose effectiveness is evaluated by direct comparison of the resulting model against independent external systems on the public AlpacaEval 2.0 leaderboard. No equations, fitted parameters, or self-citations are invoked that would reduce the reported performance gains to a definitional or tautological identity with the training inputs. The self-rewarding loop is a training algorithm whose outputs are tested against held-out benchmarks and other models rather than being presupposed by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard DPO assumptions from prior literature plus the novel self-rewarding loop. Free parameters include the number of iterations (set to 3) and the exact judge prompt template. No new physical or mathematical entities are postulated.

free parameters (2)
  • number of iterations
    Set to three to reach the reported leaderboard performance; value chosen after experimentation.
  • LLM-as-Judge prompt template
    Specific wording used to elicit self-rewards is a design choice that affects training signal quality.
axioms (1)
  • domain assumption DPO training produces reliable improvements when given preference pairs or reward signals
    Invoked implicitly when using self-generated rewards to drive updates; relies on prior DPO results.

pith-pipeline@v0.9.0 · 5489 in / 1445 out tokens · 41736 ms · 2026-05-13T11:57:27.063100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

    cs.LG 2026-05 conditional novelty 7.0

    FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

  2. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  3. Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.

  4. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  5. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 unverdicted novelty 7.0

    DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...

  6. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

    cs.LG 2026-05 conditional novelty 7.0

    DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

  7. CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.

  8. Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies

    cs.LG 2026-05 unverdicted novelty 7.0

    TACC algorithm for adaptive multi-fidelity bandits with improving proxies achieves instance-dependent regret by replacing logarithmic high-fidelity pulls with bounded low-fidelity continuation for intermediate arms.

  9. Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

    cs.CL 2026-05 accept novelty 7.0

    Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.

  10. Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.

  11. IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...

  12. Neural Garbage Collection: Learning to Forget while Learning to Reason

    cs.LG 2026-04 conditional novelty 7.0

    Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

  13. TextGrad: Automatic "Differentiation" via Text

    cs.CL 2024-06 unverdicted novelty 7.0

    TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.

  14. KTO: Model Alignment as Prospect Theoretic Optimization

    cs.LG 2024-02 conditional novelty 7.0

    KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

  15. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  16. SEIF: Self-Evolving Reinforcement Learning for Instruction Following

    cs.CL 2026-05 conditional novelty 6.0

    SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

  17. Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

    cs.CL 2026-04 unverdicted novelty 6.0

    MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.

  18. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  19. Can LLMs Learn to Reason Robustly under Noisy Supervision?

    cs.LG 2026-04 conditional novelty 6.0

    Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...

  20. AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

    cs.AI 2026-03 conditional novelty 6.0

    AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downs...

  21. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    cs.CL 2024-06 unverdicted novelty 6.0

    FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

  22. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    cs.LG 2024-01 unverdicted novelty 6.0

    SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...

  23. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    cs.CL 2026-05 unverdicted novelty 5.0

    StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

  24. ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

    cs.SE 2026-05 unverdicted novelty 4.0

    ARIS is a three-layer open-source system that uses cross-model adversarial collaboration plus claim-auditing pipelines to make LLM-driven research workflows more reliable.

  25. PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs

    cs.CL 2026-04 unverdicted novelty 4.0

    PoliLegalLM, trained with continued pretraining, progressive SFT, and preference RL on a legal corpus, outperforms similar-scale models on LawBench, LexEval, and a real-world PoliLegal dataset while staying competitiv...

  26. MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

    cs.CY 2026-04 unverdicted novelty 4.0

    MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.

  27. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  28. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    cs.CL 2024-03 unverdicted novelty 4.0

    LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.

  29. Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

    cs.CL 2026-04 unverdicted novelty 3.0

    Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.

  30. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 27 Pith papers · 21 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Think you have solved question answering?

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal=. Think you have solved question answering?

  3. [4]

    2019 , journal =

    Natural Questions: a Benchmark for Question Answering Research , author =. 2019 , journal =

  4. [5]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  5. [6]

    9th International Conference on Learning Representations,

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. 9th International Conference on Learning Representations,. 2021 , url =

  6. [7]

    Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

    Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Thirty-Fourth AAAI Conference on Artificial Intelligence , year =

  7. [10]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  8. [11]

    2023 , howpublished =

    Anthropic , title =. 2023 , howpublished =

  9. [12]

    arXiv preprint arXiv:1511.06709 , year=

    Improving neural machine translation models with monolingual data , author=. arXiv preprint arXiv:1511.06709 , year=

  10. [13]

    arXiv preprint arXiv:1906.06442 , year=

    Tagged back-translation , author=. arXiv preprint arXiv:1906.06442 , year=

  11. [16]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=

  12. [17]

    arXiv preprint arXiv:2304.08460 , year=

    Longform: Optimizing instruction tuning for long text generation with corpus extraction , author=. arXiv preprint arXiv:2304.08460 , year=

  13. [18]

    Advances in Neural Information Processing Systems , volume=

    Process for adapting language models to society (palms) with values-targeted datasets , author=. Advances in Neural Information Processing Systems , volume=

  14. [19]

    arXiv preprint arXiv:2305.11206 , year=

    Lima: Less is more for alignment , author=. arXiv preprint arXiv:2305.11206 , year=

  15. [20]

    arXiv preprint arXiv:2305.15717 , year =

    The false promise of imitating proprietary llms , author=. arXiv preprint arXiv:2305.15717 , year=

  16. [21]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

  17. [22]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  18. [23]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  19. [24]

    Instruction Tuning with GPT-4

    Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=

  20. [25]

    Opt-iml: Scaling language model instruction meta learning through the lens of generalization.arXiv preprint arXiv:2212.12017, 2022

    Opt-iml: Scaling language model instruction meta learning through the lens of generalization , author=. arXiv preprint arXiv:2212.12017 , year=

  21. [26]

    arXiv e-prints , pages=

    Scaling Instruction-Finetuned Language Models , author=. arXiv e-prints , pages=

  22. [27]

    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , author=

  23. [28]

    Finetuned Language Models Are Zero-Shot Learners

    Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

  24. [29]

    Hashimoto , title =

    Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  25. [30]

    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    ClueWeb22: 10 billion web documents with rich information , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  26. [31]

    Zhang, Xuanyu and Yang, Qing , journal=

  27. [32]

    2023 , eprint=

    Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , author=. 2023 , eprint=

  28. [33]

    2024 , url=

    Chen, Lichang and Li, Shiyang and Yan, Jun and Wang, Hai and Gunaratna, Kalpa and Yadav, Vikas and Tang, Zheng and Srinivasan, Vijay and Zhou, Tianyi and Huang, Heng and others , booktitle=. 2024 , url=

  29. [34]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Multitask prompted training enables zero-shot task generalization , author=. arXiv preprint arXiv:2110.08207 , year=

  30. [35]

    Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

    Cross-task generalization via natural language crowdsourcing instructions , author=. arXiv preprint arXiv:2104.08773 , year=

  31. [36]

    arXiv preprint arXiv:2204.07705 , year=

    Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks , author=. arXiv preprint arXiv:2204.07705 , year=

  32. [38]

    Constitutional

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=. Constitutional

  33. [39]

    Self-critiquing models for assisting human evaluators

    Self-critiquing models for assisting human evaluators , author=. arXiv preprint arXiv:2206.05802 , year=

  34. [40]

    Self-Refine: Iterative Refinement with Self-Feedback

    Self-refine: Iterative refinement with self-feedback , author=. arXiv preprint arXiv:2303.17651 , year=

  35. [41]

    Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme , year=

  36. [42]

    2023 , month =

    Wang, Guan and Cheng, Sijie and Yu, Qiying and Liu, Changling , doi =. 2023 , month =

  37. [43]

    Enhancing chat language models by scaling high-quality instructional conversations

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. arXiv preprint arXiv:2305.14233 , year=

  38. [45]

    The Curious Case of Neural Text Degeneration

    The curious case of neural text degeneration , author=. arXiv preprint arXiv:1904.09751 , year=

  39. [46]

    arXiv e-prints , pages=

    The Capacity for Moral Self-Correction in Large Language Models , author=. arXiv e-prints , pages=

  40. [47]

    arXiv preprint arXiv:2010.00133 , year=

    CrowS-pairs: A challenge dataset for measuring social biases in masked language models , author=. arXiv preprint arXiv:2010.00133 , year=

  41. [48]

    arXiv preprint arXiv:2306.04751 , year=

    How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , author=. arXiv preprint arXiv:2306.04751 , year=

  42. [50]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  43. [52]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Can a suit of armor conduct electricity? a new dataset for open book question answering , author=. arXiv preprint arXiv:1809.02789 , year=

  44. [53]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  45. [54]

    2023 , url =

    Xinyang Geng and Arnav Gudibande and Hao Liu and Eric Wallace and Pieter Abbeel and Sergey Levine and Dawn Song , title =. 2023 , url =

  46. [55]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  47. [57]

    The Twelfth International Conference on Learning Representations , year=

    Self-alignment with instruction backtranslation , author=. The Twelfth International Conference on Learning Representations , year=

  48. [59]

    Gonzalez and Ion Stoica , booktitle=

    Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , booktitle=. Judging. 2023 , url=

  49. [60]

    Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=

  50. [61]

    Visualizing data using

    Van der Maaten, Laurens and Hinton, Geoffrey , journal=. Visualizing data using

  51. [63]

    Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Lu, Kellie and Mesnard, Thomas and Bishop, Colton and Carbune, Victor and Rastogi, Abhinav , journal=

  52. [64]

    Advances in Neural Information Processing Systems , volume=

    Learning to summarize with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  53. [72]

    Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J , journal=

  54. [73]

    2023 , url=

    Hongyi Yuan and Zheng Yuan and Chuanqi Tan and Wei Wang and Songfang Huang and Fei Huang , booktitle=. 2023 , url=

  55. [75]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Benchmarking Foundation Models with Language-Model-as-an-Examiner , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  56. [78]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  57. [80]

    Proceedings of the 25th International Conference on Machine Learning , pages=

    A unified architecture for natural language processing: Deep neural networks with multitask learning , author=. Proceedings of the 25th International Conference on Machine Learning , pages=

  58. [81]

    Machine learning , volume=

    Multitask learning , author=. Machine learning , volume=. 1997 , publisher=

  59. [82]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  60. [83]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  61. [84]

    The CRINGE loss: Learning what language not to model

    Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The CRINGE loss: Learning what language not to model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8854--8874, Toronto, Canada...

  62. [85]

    Claude 2

    Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 2023

  63. [86]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022 a

  64. [87]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022 b

  65. [88]

    Benchmarking foundation models with language-model-as-an-examiner

    Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. Benchmarking foundation models with language-model-as-an-examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/for...

  66. [89]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

  67. [90]

    AlpaGasus : Training a better alpaca with fewer data

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. AlpaGasus : Training a better alpaca with fewer data. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=FdVXgSJhvz

  68. [91]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024 b

  69. [92]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  70. [93]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  71. [94]

    A unified architecture for natural language processing: Deep neural networks with multitask learning

    Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160--167, 2008

  72. [95]

    Alpacafarm: A simulation framework for methods that learn from human feedback

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023

  73. [96]

    The Devil Is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

    Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, Andr \'e Martins, Graham Neubig, Ankush Garg, Jonathan Clark, Markus Freitag, and Orhan Firat. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of...

  74. [97]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

  75. [98]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  76. [99]

    Unnatural instructions: Tuning language models with (almost) no human labor

    Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409--14428, Toronto, Canada, July 202...

  77. [100]

    Prometheus: Inducing fine-grained evaluation capability in language models

    Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023

  78. [101]

    o pf, Yannic Kilcher, Dimitri von R \

    Andreas K \"o pf, Yannic Kilcher, Dimitri von R \"u tte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich \'a rd Nagyfi, et al. OpenAssistant conversations--democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023

  79. [102]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transac...

  80. [103]

    rlhf: Scaling reinforcement learning from human feedback with ai feedback , author=

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. RLAIF : Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023

Showing first 80 references.