pith. machine review for the scientific record. sign in

arxiv: 2310.16944 · v1 · submitted 2023-10-25 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Zephyr: Direct Distillation of LM Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords language model alignmentdirect preference optimizationAI feedbackdistillationchat models7B parametersMT-Bench
0
0 comments X

The pith

Direct distillation of alignment via AI preference data produces a 7B chat model that beats larger RLHF models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors show how to align a small language model to user intent by using preference rankings from a larger teacher model instead of human feedback. They apply distilled direct preference optimization after initial supervised fine-tuning to create Zephyr-7B. This process takes only a few hours and needs no human annotations. The resulting model sets new standards for 7B parameter chat systems and outperforms the leading open 70B RLHF model on MT-Bench. Sympathetic readers would care because it suggests alignment can be scaled efficiently without costly human data collection.

Core claim

Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models and surpasses Llama2-Chat-70B on MT-Bench.

What carries the argument

distilled direct preference optimization (dDPO), which optimizes the policy directly on preference pairs ranked by a teacher model without requiring online sampling.

If this is right

  • Alignment of language models becomes possible without collecting human preference data.
  • Smaller models can achieve competitive or superior chat performance compared to much larger models trained with RLHF.
  • The training process is efficient, completing in just a few hours on standard hardware.
  • Base models can be quickly adapted to chat capabilities using only AI-generated rankings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method may allow alignment techniques to be applied iteratively with multiple teacher models to reduce bias.
  • Similar distillation approaches could extend to other alignment objectives beyond chat, such as reasoning or safety.
  • If teacher models improve, the quality of distilled alignment could scale without additional human effort.

Load-bearing premise

The preference rankings from the teacher model are sufficiently similar to what humans would provide for the same outputs.

What would settle it

Human evaluations on MT-Bench showing that Zephyr-7B responses are rated lower than those from human-RLHF models like Llama2-Chat-70B.

read the original abstract

We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims to produce Zephyr-7B, a 7B LM with improved intent alignment by applying distilled direct preference optimization (dDPO) to AI Feedback (AIF) preference rankings from a teacher model on UltraFeedback. It reports SOTA chat benchmark performance for 7B models, surpassing Llama2-Chat-70B on MT-Bench, with no human annotation needed, and releases code, models, and data.

Significance. If the results hold, this work demonstrates an efficient distillation method for alignment using only AIF data, achieving strong benchmark performance without RLHF or human feedback. The public release of all artifacts is a key strength for reproducibility.

major comments (1)
  1. [Abstract] The claim of surpassing Llama2-Chat-70B on MT-Bench is load-bearing for the SOTA assertion, but without reported human-AIF correlation or controls for teacher bias in the evaluation (MT-Bench uses GPT-4 as judge), it is unclear if the gains reflect genuine alignment or evaluator matching.
minor comments (2)
  1. Provide more explicit details on the dDPO implementation, including the exact loss formulation and how it differs from standard DPO.
  2. [Abstract] Specify the size and identity of the teacher model used for AIF.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying a potential source of bias in our primary evaluation. We address the concern directly below and propose a targeted revision to improve clarity without altering the reported results.

read point-by-point responses
  1. Referee: [Abstract] The claim of surpassing Llama2-Chat-70B on MT-Bench is load-bearing for the SOTA assertion, but without reported human-AIF correlation or controls for teacher bias in the evaluation (MT-Bench uses GPT-4 as judge), it is unclear if the gains reflect genuine alignment or evaluator matching.

    Authors: We acknowledge that both the UltraFeedback preference data and the MT-Bench judgments rely on GPT-4, which could in principle favor models whose outputs align with GPT-4's stylistic preferences. To address this, we note that the performance lift of Zephyr-7B over its dSFT baseline (which uses identical data but no preference optimization) is observed consistently across MT-Bench, AlpacaEval, and the Open LLM Leaderboard, where the latter two employ different judging protocols. This differential improvement suggests that dDPO contributes alignment gains beyond simple teacher-style matching. We did not collect new human-AIF correlation statistics in this work, as the manuscript focuses on the efficiency of the distillation pipeline rather than re-validating AIF; existing literature (e.g., on UltraFeedback) already reports moderate-to-high correlation for similar setups. In the revised version we will (i) qualify the abstract claim to specify the MT-Bench protocol, (ii) add a short paragraph in the evaluation section discussing the shared GPT-4 source and the cross-benchmark consistency, and (iii) include a limitations note on the absence of fresh human correlation data. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline uses external teacher and independent benchmarks

full rationale

The paper's core claim is an empirical result: training Zephyr-7B via dSFT followed by dDPO on AIF rankings from an external teacher model, then measuring performance on public chat benchmarks (MT-Bench, etc.). No derivation step reduces a prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the paper. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes that smuggle in the target result. Minor self-citations to prior alignment work exist but are not load-bearing for the reported performance numbers.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that AI-generated preference rankings can substitute for human ones and on standard supervised optimization assumptions; no new entities are postulated and the only free parameters are ordinary training hyperparameters.

free parameters (1)
  • learning rate and batch size
    Standard hyperparameters chosen during the few-hour training run; their specific values are not required for the high-level claim but affect final numbers.
axioms (1)
  • domain assumption AI Feedback preference rankings correlate sufficiently with human intent to produce aligned behavior
    Invoked when the authors replace human annotation with teacher-model rankings to generate the dDPO dataset.

pith-pipeline@v0.9.0 · 5544 in / 1323 out tokens · 39173 ms · 2026-05-16T10:10:11.233482+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ORPO: Monolithic Preference Optimization without Reference Model

    cs.CL 2024-03 conditional novelty 8.0

    ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

  2. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  3. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  4. IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...

  5. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

    cs.CL 2024-06 unverdicted novelty 7.0

    Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...

  6. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  7. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  8. Leveraging RAG for Training-Free Alignment of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...

  9. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  10. You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

    cs.CR 2026-05 unverdicted novelty 6.0

    NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...

  11. DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

    cs.CL 2026-04 unverdicted novelty 6.0

    DART raises difference-awareness accuracy from 39% to 68.8% on benchmarks while cutting harm-drift cases by 72.6% and improving real-world appropriate responses from 39.8% to 77.5%.

  12. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  13. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

    cs.IR 2023-12 conditional novelty 6.0

    RankZephyr is a new open-source LLM that closes the effectiveness gap with GPT-4 for zero-shot listwise reranking while showing robustness to input ordering and document count.

  14. Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

    cs.CR 2026-04 unverdicted novelty 5.0

    Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.

  15. Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

    cs.CL 2026-04 unverdicted novelty 5.0

    Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.

  16. Calibrating Model-Based Evaluation Metrics for Summarization

    cs.CL 2026-04 unverdicted novelty 5.0

    A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

  17. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

    cs.CV 2026-04 unverdicted novelty 5.0

    OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

  18. MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction

    cs.CY 2026-04 unverdicted novelty 4.0

    MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.

  19. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 18 Pith papers · 13 internal anchors

  1. [1]

    Proximal Policy Optimization Algorithms

    Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg. Proximal Policy Optimization Algorithms. arXiv:1707.06347

  2. [2]

    Lila: A Unified Benchmark for Mathematical Reasoning

    Mishra, Swaroop and Finlayson, Matthew and Lu, Pan and Tang, Leonard and Welleck, Sean and Baral, Chitta and Rajpurohit, Tanmay and Tafjord, Oyvind and Sabharwal, Ashish and Clark, Peter and Kalyan, Ashwin. Lila: A Unified Benchmark for Mathematical Reasoning. arXiv:2210.17517

  3. [3]

    The False Promise of Imitating Proprietary LLMs

    Gudibande, Arnav and Wallace, Eric and Snell, Charlie and Geng, Xinyang and Liu, Hao and Abbeel, Pieter and Levine, Sergey and Song, Dawn. The False Promise of Imitating Proprietary LLMs. arXiv:2305.15717

  4. [4]

    Measuring Massive Multitask Language Understanding

    Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob. Measuring Massive Multitask Language Understanding. arXiv [cs.CY]

  5. [5]

    MetaMath : Bootstrap Your Own Mathematical Questions for Large Language Models

    Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang. MetaMath : Bootstrap Your Own Mathematical Questions for Large Language Models. arXiv preprint arXiv:2309. 12284

  6. [6]

    GPT-J-6B : A 6 billion parameter autoregressive language model

    Wang, Ben and Komatsuzaki, Aran. GPT-J-6B : A 6 billion parameter autoregressive language model

  7. [7]

    Alpaca: A strong, replicable instruction-following model

    Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html

  8. [8]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Zheng, Zhi and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. arXiv:2305.14233

  9. [9]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report. arXiv:2303.08774

  10. [10]

    UltraFeedback : Boosting Language Models with High-quality Feedback

    Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong. UltraFeedback : Boosting Language Models with High-quality Feedback. arXiv:2310.01377

  11. [11]

    XGen-7B Technical Report

    Nijkamp, Erik and Xie, Tian and Hayashi, Hiroaki and Pang, Bo and Xia, Congying and Xing, Chen and Vig, Jesse and Yavuz, Semih and Laban, Philippe and Krause, Ben and Purushwalkam, Senthil and Niu, Tong and Kry \'s ci \'n ski, Wojciech and Murakhovs'ka, Lidiya and Choubey, Prafulla Kumar and Fabbri, Alex and Liu, Ye and Meng, Rui and Tu, Lifu and Bhat, Me...

  12. [12]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685

  13. [13]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

  14. [14]

    Training language models to follow instructions with human feedback

    Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ry...

  15. [15]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Ferrer, Cristian Canton and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...

  16. [16]

    Mistral 7B

    Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, L \'e lio Renard and Lachaux, Marie-Anne and Stock, Pierre and Le Scao, Teven and Lavril, Thibaut and Wang, Thomas and Lacroix...

  17. [17]

    Scaling Instruction-Finetuned Language Models

    Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Yunxuan and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Castro-Ros, Alex and Pellat, Marie and Robinson, Kevin and V...

  18. [18]

    Self-Instruct : Aligning Language Models with Self-Generated Instructions

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A and Khashabi, Daniel and Hajishirzi, Hannaneh. Self-Instruct : Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  19. [19]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

  20. [20]

    MAmmoTH : Building math generalist models through hybrid instruction tuning

    Chen, Wenhu. MAmmoTH : Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309. 05653

  21. [21]

    Open LLM Leaderboard

    Edward Beeching, Cl \'e mentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf. Open LLM Leaderboard

  22. [22]

    NEFTune : Noisy Embeddings Improve Instruction Finetuning

    Jain, Neel and Chiang, Ping-Yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom. NEFTune : Noisy Embeddings Improve Instruction Finetuning. arXiv:2310.05914

  23. [23]

    TRL : Transformer Reinforcement Learning

    von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi. TRL : Transformer Reinforcement Learning. GitHub repository

  24. [24]

    Team, Xwin-Lm. Xwin-LM

  25. [25]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Sanh, Victor and Webson, Albert and Raffel, Colin and Bach, Stephen H and Sutawika, Lintang and Alyafeai, Zaid and Chaffin, Antoine and Stiegler, Arnaud and Le Scao, Teven and Raja, Arun and Dey, Manan and Saiful Bari, M and Xu, Canwen and Thakker, Urmish and Sharma, Shanya Sharma and Szczechla, Eliza and Kim, Taewoon and Chhablani, Gunjan and Nayak, Niha...

  26. [26]

    Wizardlm: Empowering large language models to follow complex instructions

    Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304. 12244

  27. [27]

    AlpacaEval : An Automatic Evaluator of Instruction-following Models

    Li, Xuechen and Zhang, Tianyi and Dubois, Yann and Taori, Rohan and Gulrajani, Ishaan and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. AlpacaEval : An Automatic Evaluator of Instruction-following Models. GitHub repository

  28. [28]

    A framework for few-shot language model evaluation

    Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy. A framework for few-shot language model evaluation

  29. [29]

    Think you have Solved Question Answering? Try ARC , the AI2 Reasoning Challenge

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind. Think you have Solved Question Answering? Try ARC , the AI2 Reasoning Challenge. arXiv [cs.AI]

  30. [30]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290

  31. [31]

    Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ ChatGPT Quality

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E and Stoica, Ion and Xing, Eric P. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ ChatGPT Quality

  32. [32]

    HellaSwag : Can a Machine Really Finish Your Sentence?

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. HellaSwag : Can a Machine Really Finish Your Sentence?. arXiv [cs.CL]

  33. [33]

    TruthfulQA : Measuring How Models Mimic Human Falsehoods

    Lin, Stephanie and Hilton, Jacob and Evans, Owain. TruthfulQA : Measuring How Models Mimic Human Falsehoods. arXiv [cs.CL]

  34. [34]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R \'e mi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Dram...

  35. [35]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  36. [36]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  37. [37]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  38. [38]

    2023 , eprint=

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , author=. 2023 , eprint=

  39. [39]

    2023 , url=

    Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned and chat models , author=. 2023 , url=

  40. [40]

    2023 , url=

    Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs , author=. 2023 , url=

  41. [41]

    2023 , eprint=

    AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=

  42. [42]

    2023 , eprint=

    Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance , author=. 2023 , eprint=

  43. [43]

    ChatEval: A Tool for Chatbot Evaluation

    Sedoc, Jo a o and Ippolito, Daphne and Kirubarajan, Arun and Thirani, Jai and Ungar, Lyle and Callison-Burch, Chris. ChatEval: A Tool for Chatbot Evaluation. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 2019

  44. [44]

    GitHub repository

    FastEval. GitHub repository

  45. [45]

    2023 , publisher =

    Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf , title =. 2023 , publisher =

  46. [46]

    2020 , eprint=

    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. 2020 , eprint=

  47. [47]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  48. [48]

    Dao, Tri , year=. Flash

  49. [49]

    2023 , eprint=

    QLoRA: Efficient Finetuning of Quantized LLMs , author=. 2023 , eprint=

  50. [50]

    De Vries, Harm , title =

  51. [51]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  52. [52]

    Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned and chat models, 2023

    Together AI. Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned and chat models, 2023. URL https://together.ai/blog/redpajama-models-v1

  53. [53]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  54. [54]

    Open llm leaderboard

    Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

  55. [55]

    Vicuna: An Open-Source chatbot impressing GPT-4 with 90\

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, Ion Stoica, and Eric P Xing. Vicuna: An Open-Source chatbot impressing GPT-4 with 90\

  56. [56]

    Scaling Instruction-Finetuned language models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...

  57. [57]

    Think you have solved question answering? try ARC , the AI2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC , the AI2 reasoning challenge, 2018

  58. [58]

    UltraFeedback : Boosting language models with high-quality feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback : Boosting language models with high-quality feedback. October 2023

  59. [59]

    Flash A ttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

  60. [60]

    Go smol or go home, 2023

    Harm De Vries. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/

  61. [61]

    Qlora: Efficient finetuning of quantized llms, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023

  62. [62]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. May 2023

  63. [63]

    Hashimoto

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023

  64. [64]

    Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023

    Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023

  65. [65]

    The false promise of imitating proprietary LLMs

    Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary LLMs . May 2023

  66. [66]

    Measuring massive multitask language understanding, 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

  67. [67]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  68. [68]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. Mistral 7B . October 2023

  69. [69]

    AlpacaEval : An automatic evaluator of instruction-following models, 2023

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. AlpacaEval : An automatic evaluator of instruction-following models, 2023

  70. [70]

    TruthfulQA : Measuring how models mimic human falsehoods, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA : Measuring how models mimic human falsehoods, 2022

  71. [71]

    Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

    Mosaic ML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL https://www.mosaicml.com/blog/mpt-7b

  72. [72]

    GPT-4 technical report

    OpenAI . GPT-4 technical report. March 2023

  73. [73]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. ...

  74. [74]

    The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

  75. [75]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. May 2023

  76. [76]

    Zero: Memory optimizations toward training trillion parameter models, 2020

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020

  77. [77]

    Multitask prompted training enables Zero-Shot task generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matte...

  78. [78]

    Proximal policy optimization algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. July 2017

  79. [79]

    Chateval: A tool for chatbot evaluation

    Jo a o Sedoc, Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch. Chateval: A tool for chatbot evaluation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp.\ 60--65. Association for Computational Linguistics, 2019. URL http://aclweb.o...

  80. [80]

    Alpaca: A strong, replicable instruction-following model

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3 0 (6): 0 7, 2023

Showing first 80 references.