arxiv: 2310.16944 · v1 · submitted 2023-10-25 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Zephyr: Direct Distillation of LM Alignment

Lewis Tunstall , Edward Beeching , Nathan Lambert , Nazneen Rajani , Kashif Rasul , Younes Belkada , Shengyi Huang , Leandro von Werra

show 6 more authors

Cl\'ementine Fourrier Nathan Habib Nathan Sarrazin Omar Sanseviero Alexander M. Rush Thomas Wolf

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords language model alignmentdirect preference optimizationAI feedbackdistillationchat models7B parametersMT-Bench

0 comments

The pith

Direct distillation of alignment via AI preference data produces a 7B chat model that beats larger RLHF models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors show how to align a small language model to user intent by using preference rankings from a larger teacher model instead of human feedback. They apply distilled direct preference optimization after initial supervised fine-tuning to create Zephyr-7B. This process takes only a few hours and needs no human annotations. The resulting model sets new standards for 7B parameter chat systems and outperforms the leading open 70B RLHF model on MT-Bench. Sympathetic readers would care because it suggests alignment can be scaled efficiently without costly human data collection.

Core claim

Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models and surpasses Llama2-Chat-70B on MT-Bench.

What carries the argument

distilled direct preference optimization (dDPO), which optimizes the policy directly on preference pairs ranked by a teacher model without requiring online sampling.

If this is right

Alignment of language models becomes possible without collecting human preference data.
Smaller models can achieve competitive or superior chat performance compared to much larger models trained with RLHF.
The training process is efficient, completing in just a few hours on standard hardware.
Base models can be quickly adapted to chat capabilities using only AI-generated rankings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method may allow alignment techniques to be applied iteratively with multiple teacher models to reduce bias.
Similar distillation approaches could extend to other alignment objectives beyond chat, such as reasoning or safety.
If teacher models improve, the quality of distilled alignment could scale without additional human effort.

Load-bearing premise

The preference rankings from the teacher model are sufficiently similar to what humans would provide for the same outputs.

What would settle it

Human evaluations on MT-Bench showing that Zephyr-7B responses are rated lower than those from human-RLHF models like Llama2-Chat-70B.

read the original abstract

We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Zephyr shows a workable recipe for turning a 7B base model into a strong chat system with only synthetic preferences and a few hours of training, but the alignment gains rest on an untested assumption that AI Feedback tracks human intent.

read the letter

The key point is that this paper gives a concrete, fully released pipeline—dSFT followed by dDPO on UltraFeedback rankings—that lifts a 7B model to MT-Bench scores above Llama-2-Chat-70B. The training is fast, the code and data are public, and the numbers are reproducible by anyone who runs the scripts. That combination is useful right now for groups that want open chat models without collecting human preferences themselves.

Referee Report

1 major / 2 minor

Summary. The paper claims to produce Zephyr-7B, a 7B LM with improved intent alignment by applying distilled direct preference optimization (dDPO) to AI Feedback (AIF) preference rankings from a teacher model on UltraFeedback. It reports SOTA chat benchmark performance for 7B models, surpassing Llama2-Chat-70B on MT-Bench, with no human annotation needed, and releases code, models, and data.

Significance. If the results hold, this work demonstrates an efficient distillation method for alignment using only AIF data, achieving strong benchmark performance without RLHF or human feedback. The public release of all artifacts is a key strength for reproducibility.

major comments (1)

[Abstract] The claim of surpassing Llama2-Chat-70B on MT-Bench is load-bearing for the SOTA assertion, but without reported human-AIF correlation or controls for teacher bias in the evaluation (MT-Bench uses GPT-4 as judge), it is unclear if the gains reflect genuine alignment or evaluator matching.

minor comments (2)

Provide more explicit details on the dDPO implementation, including the exact loss formulation and how it differs from standard DPO.
[Abstract] Specify the size and identity of the teacher model used for AIF.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying a potential source of bias in our primary evaluation. We address the concern directly below and propose a targeted revision to improve clarity without altering the reported results.

read point-by-point responses

Referee: [Abstract] The claim of surpassing Llama2-Chat-70B on MT-Bench is load-bearing for the SOTA assertion, but without reported human-AIF correlation or controls for teacher bias in the evaluation (MT-Bench uses GPT-4 as judge), it is unclear if the gains reflect genuine alignment or evaluator matching.

Authors: We acknowledge that both the UltraFeedback preference data and the MT-Bench judgments rely on GPT-4, which could in principle favor models whose outputs align with GPT-4's stylistic preferences. To address this, we note that the performance lift of Zephyr-7B over its dSFT baseline (which uses identical data but no preference optimization) is observed consistently across MT-Bench, AlpacaEval, and the Open LLM Leaderboard, where the latter two employ different judging protocols. This differential improvement suggests that dDPO contributes alignment gains beyond simple teacher-style matching. We did not collect new human-AIF correlation statistics in this work, as the manuscript focuses on the efficiency of the distillation pipeline rather than re-validating AIF; existing literature (e.g., on UltraFeedback) already reports moderate-to-high correlation for similar setups. In the revised version we will (i) qualify the abstract claim to specify the MT-Bench protocol, (ii) add a short paragraph in the evaluation section discussing the shared GPT-4 source and the cross-benchmark consistency, and (iii) include a limitations note on the absence of fresh human correlation data. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline uses external teacher and independent benchmarks

full rationale

The paper's core claim is an empirical result: training Zephyr-7B via dSFT followed by dDPO on AIF rankings from an external teacher model, then measuring performance on public chat benchmarks (MT-Bench, etc.). No derivation step reduces a prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the paper. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes that smuggle in the target result. Minor self-citations to prior alignment work exist but are not load-bearing for the reported performance numbers.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that AI-generated preference rankings can substitute for human ones and on standard supervised optimization assumptions; no new entities are postulated and the only free parameters are ordinary training hyperparameters.

free parameters (1)

learning rate and batch size
Standard hyperparameters chosen during the few-hour training run; their specific values are not required for the high-level claim but affect final numbers.

axioms (1)

domain assumption AI Feedback preference rankings correlate sufficiently with human intent to produce aligned behavior
Invoked when the authors replace human annotation with teacher-model rankings to generate the dDPO dataset.

pith-pipeline@v0.9.0 · 5544 in / 1323 out tokens · 39173 ms · 2026-05-16T10:10:11.233482+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
cs.LG 2026-04 unverdicted novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
cs.CL 2024-06 unverdicted novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
Leveraging RAG for Training-Free Alignment of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
Response Time Enhances Alignment with Heterogeneous Preferences
cs.LG 2026-05 unverdicted novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
cs.CR 2026-05 unverdicted novelty 6.0

NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
cs.CL 2026-04 unverdicted novelty 6.0

DART raises difference-awareness accuracy from 39% to 68.8% on benchmarks while cutting harm-drift cases by 72.6% and improving real-world appropriate responses from 39.8% to 77.5%.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
cs.IR 2023-12 conditional novelty 6.0

RankZephyr is a new open-source LLM that closes the effectiveness gap with GPT-4 for zero-shot listwise reranking while showing robustness to input ordering and document count.
Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents
cs.CR 2026-04 unverdicted novelty 5.0

Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
cs.CL 2026-04 unverdicted novelty 5.0

Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
cs.CV 2026-04 unverdicted novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
cs.CY 2026-04 unverdicted novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 18 Pith papers · 13 internal anchors

[1]

Proximal Policy Optimization Algorithms

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg. Proximal Policy Optimization Algorithms. arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Lila: A Unified Benchmark for Mathematical Reasoning

Mishra, Swaroop and Finlayson, Matthew and Lu, Pan and Tang, Leonard and Welleck, Sean and Baral, Chitta and Rajpurohit, Tanmay and Tafjord, Oyvind and Sabharwal, Ashish and Clark, Peter and Kalyan, Ashwin. Lila: A Unified Benchmark for Mathematical Reasoning. arXiv:2210.17517

work page arXiv
[3]

The False Promise of Imitating Proprietary LLMs

Gudibande, Arnav and Wallace, Eric and Snell, Charlie and Geng, Xinyang and Liu, Hao and Abbeel, Pieter and Levine, Sergey and Song, Dawn. The False Promise of Imitating Proprietary LLMs. arXiv:2305.15717

work page arXiv
[4]

Measuring Massive Multitask Language Understanding

Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob. Measuring Massive Multitask Language Understanding. arXiv [cs.CY]

work page
[5]

MetaMath : Bootstrap Your Own Mathematical Questions for Large Language Models

Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang. MetaMath : Bootstrap Your Own Mathematical Questions for Large Language Models. arXiv preprint arXiv:2309. 12284

work page
[6]

GPT-J-6B : A 6 billion parameter autoregressive language model

Wang, Ben and Komatsuzaki, Aran. GPT-J-6B : A 6 billion parameter autoregressive language model

work page
[7]

Alpaca: A strong, replicable instruction-following model

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html

work page 2023
[8]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Zheng, Zhi and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. arXiv:2305.14233

work page internal anchor Pith review Pith/arXiv arXiv
[9]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv
[10]

UltraFeedback : Boosting Language Models with High-quality Feedback

Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong. UltraFeedback : Boosting Language Models with High-quality Feedback. arXiv:2310.01377

work page arXiv
[11]

XGen-7B Technical Report

Nijkamp, Erik and Xie, Tian and Hayashi, Hiroaki and Pang, Bo and Xia, Congying and Xing, Chen and Vig, Jesse and Yavuz, Semih and Laban, Philippe and Krause, Ben and Purushwalkam, Senthil and Niu, Tong and Kry \'s ci \'n ski, Wojciech and Murakhovs'ka, Lidiya and Choubey, Prafulla Kumar and Fabbri, Alex and Liu, Ye and Meng, Rui and Tu, Lifu and Bhat, Me...

work page arXiv
[12]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Training language models to follow instructions with human feedback

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ry...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Ferrer, Cristian Canton and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Mistral 7B

Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, L \'e lio Renard and Lachaux, Marie-Anne and Stock, Pierre and Le Scao, Teven and Lavril, Thibaut and Wang, Thomas and Lacroix...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Scaling Instruction-Finetuned Language Models

Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Yunxuan and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Castro-Ros, Alex and Pellat, Marie and Robinson, Kevin and V...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Self-Instruct : Aligning Language Models with Self-Generated Instructions

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A and Khashabi, Daniel and Hajishirzi, Hannaneh. Self-Instruct : Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

work page
[19]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

MAmmoTH : Building math generalist models through hybrid instruction tuning

Chen, Wenhu. MAmmoTH : Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309. 05653

work page
[21]

Open LLM Leaderboard

Edward Beeching, Cl \'e mentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf. Open LLM Leaderboard

work page
[22]

NEFTune : Noisy Embeddings Improve Instruction Finetuning

Jain, Neel and Chiang, Ping-Yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom. NEFTune : Noisy Embeddings Improve Instruction Finetuning. arXiv:2310.05914

work page arXiv
[23]

TRL : Transformer Reinforcement Learning

von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi. TRL : Transformer Reinforcement Learning. GitHub repository

work page
[24]

Team, Xwin-Lm. Xwin-LM

work page
[25]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Sanh, Victor and Webson, Albert and Raffel, Colin and Bach, Stephen H and Sutawika, Lintang and Alyafeai, Zaid and Chaffin, Antoine and Stiegler, Arnaud and Le Scao, Teven and Raja, Arun and Dey, Manan and Saiful Bari, M and Xu, Canwen and Thakker, Urmish and Sharma, Shanya Sharma and Szczechla, Eliza and Kim, Taewoon and Chhablani, Gunjan and Nayak, Niha...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Wizardlm: Empowering large language models to follow complex instructions

Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304. 12244

work page
[27]

AlpacaEval : An Automatic Evaluator of Instruction-following Models

Li, Xuechen and Zhang, Tianyi and Dubois, Yann and Taori, Rohan and Gulrajani, Ishaan and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. AlpacaEval : An Automatic Evaluator of Instruction-following Models. GitHub repository

work page
[28]

A framework for few-shot language model evaluation

Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy. A framework for few-shot language model evaluation

work page
[29]

Think you have Solved Question Answering? Try ARC , the AI2 Reasoning Challenge

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind. Think you have Solved Question Answering? Try ARC , the AI2 Reasoning Challenge. arXiv [cs.AI]

work page
[30]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ ChatGPT Quality

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E and Stoica, Ion and Xing, Eric P. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ ChatGPT Quality

work page
[32]

HellaSwag : Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. HellaSwag : Can a Machine Really Finish Your Sentence?. arXiv [cs.CL]

work page
[33]

TruthfulQA : Measuring How Models Mimic Human Falsehoods

Lin, Stephanie and Hilton, Jacob and Evans, Owain. TruthfulQA : Measuring How Models Mimic Human Falsehoods. arXiv [cs.CL]

work page
[34]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R \'e mi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Dram...

work page internal anchor Pith review Pith/arXiv arXiv 1910
[35]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[36]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[37]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[38]

2023 , eprint=

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , author=. 2023 , eprint=

work page 2023
[39]

2023 , url=

Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned and chat models , author=. 2023 , url=

work page 2023
[40]

2023 , url=

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs , author=. 2023 , url=

work page 2023
[41]

2023 , eprint=

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=

work page 2023
[42]

2023 , eprint=

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance , author=. 2023 , eprint=

work page 2023
[43]

ChatEval: A Tool for Chatbot Evaluation

Sedoc, Jo a o and Ippolito, Daphne and Kirubarajan, Arun and Thirani, Jai and Ungar, Lyle and Callison-Burch, Chris. ChatEval: A Tool for Chatbot Evaluation. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 2019

work page 2019
[44]

GitHub repository

FastEval. GitHub repository

work page
[45]

2023 , publisher =

Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf , title =. 2023 , publisher =

work page 2023
[46]

2020 , eprint=

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. 2020 , eprint=

work page 2020
[47]

2021 , eprint=

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

work page 2021
[48]

Dao, Tri , year=. Flash

work page
[49]

2023 , eprint=

QLoRA: Efficient Finetuning of Quantized LLMs , author=. 2023 , eprint=

work page 2023
[50]

De Vries, Harm , title =

work page
[51]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[52]

Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned and chat models, 2023

Together AI. Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned and chat models, 2023. URL https://together.ai/blog/redpajama-models-v1

work page 2023
[53]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page 2022
[54]

Open llm leaderboard

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

work page 2023
[55]

Vicuna: An Open-Source chatbot impressing GPT-4 with 90\

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, Ion Stoica, and Eric P Xing. Vicuna: An Open-Source chatbot impressing GPT-4 with 90\

work page
[56]

Scaling Instruction-Finetuned language models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...

work page 2022
[57]

Think you have solved question answering? try ARC , the AI2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC , the AI2 reasoning challenge, 2018

work page 2018
[58]

UltraFeedback : Boosting language models with high-quality feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback : Boosting language models with high-quality feedback. October 2023

work page 2023
[59]

Flash A ttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023

work page 2023
[60]

Go smol or go home, 2023

Harm De Vries. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/

work page 2023
[61]

Qlora: Efficient finetuning of quantized llms, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023

work page 2023
[62]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. May 2023

work page 2023
[63]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023

work page 2023
[64]

Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023

Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023

work page 2023
[65]

The false promise of imitating proprietary LLMs

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary LLMs . May 2023

work page 2023
[66]

Measuring massive multitask language understanding, 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021

work page 2021
[67]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021
[68]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. Mistral 7B . October 2023

work page 2023
[69]

AlpacaEval : An automatic evaluator of instruction-following models, 2023

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. AlpacaEval : An automatic evaluator of instruction-following models, 2023

work page 2023
[70]

TruthfulQA : Measuring how models mimic human falsehoods, 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA : Measuring how models mimic human falsehoods, 2022

work page 2022
[71]

Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

Mosaic ML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL https://www.mosaicml.com/blog/mpt-7b

work page 2023
[72]

GPT-4 technical report

OpenAI . GPT-4 technical report. March 2023

work page 2023
[73]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. ...

work page 2022
[74]

The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

work page 2023
[75]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. May 2023

work page 2023
[76]

Zero: Memory optimizations toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020

work page 2020
[77]

Multitask prompted training enables Zero-Shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matte...

work page 2021
[78]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. July 2017

work page 2017
[79]

Chateval: A tool for chatbot evaluation

Jo a o Sedoc, Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch. Chateval: A tool for chatbot evaluation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp.\ 60--65. Association for Computational Linguistics, 2019. URL http://aclweb.o...

work page 2019
[80]

Alpaca: A strong, replicable instruction-following model

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3 0 (6): 0 7, 2023

work page 2023

Showing first 80 references.