Recognition: 2 theorem links
· Lean TheoremZephyr: Direct Distillation of LM Alignment
Pith reviewed 2026-05-16 10:10 UTC · model grok-4.3
The pith
Direct distillation of alignment via AI preference data produces a 7B chat model that beats larger RLHF models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models and surpasses Llama2-Chat-70B on MT-Bench.
What carries the argument
distilled direct preference optimization (dDPO), which optimizes the policy directly on preference pairs ranked by a teacher model without requiring online sampling.
If this is right
- Alignment of language models becomes possible without collecting human preference data.
- Smaller models can achieve competitive or superior chat performance compared to much larger models trained with RLHF.
- The training process is efficient, completing in just a few hours on standard hardware.
- Base models can be quickly adapted to chat capabilities using only AI-generated rankings.
Where Pith is reading between the lines
- This method may allow alignment techniques to be applied iteratively with multiple teacher models to reduce bias.
- Similar distillation approaches could extend to other alignment objectives beyond chat, such as reasoning or safety.
- If teacher models improve, the quality of distilled alignment could scale without additional human effort.
Load-bearing premise
The preference rankings from the teacher model are sufficiently similar to what humans would provide for the same outputs.
What would settle it
Human evaluations on MT-Bench showing that Zephyr-7B responses are rated lower than those from human-RLHF models like Llama2-Chat-70B.
read the original abstract
We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to produce Zephyr-7B, a 7B LM with improved intent alignment by applying distilled direct preference optimization (dDPO) to AI Feedback (AIF) preference rankings from a teacher model on UltraFeedback. It reports SOTA chat benchmark performance for 7B models, surpassing Llama2-Chat-70B on MT-Bench, with no human annotation needed, and releases code, models, and data.
Significance. If the results hold, this work demonstrates an efficient distillation method for alignment using only AIF data, achieving strong benchmark performance without RLHF or human feedback. The public release of all artifacts is a key strength for reproducibility.
major comments (1)
- [Abstract] The claim of surpassing Llama2-Chat-70B on MT-Bench is load-bearing for the SOTA assertion, but without reported human-AIF correlation or controls for teacher bias in the evaluation (MT-Bench uses GPT-4 as judge), it is unclear if the gains reflect genuine alignment or evaluator matching.
minor comments (2)
- Provide more explicit details on the dDPO implementation, including the exact loss formulation and how it differs from standard DPO.
- [Abstract] Specify the size and identity of the teacher model used for AIF.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for identifying a potential source of bias in our primary evaluation. We address the concern directly below and propose a targeted revision to improve clarity without altering the reported results.
read point-by-point responses
-
Referee: [Abstract] The claim of surpassing Llama2-Chat-70B on MT-Bench is load-bearing for the SOTA assertion, but without reported human-AIF correlation or controls for teacher bias in the evaluation (MT-Bench uses GPT-4 as judge), it is unclear if the gains reflect genuine alignment or evaluator matching.
Authors: We acknowledge that both the UltraFeedback preference data and the MT-Bench judgments rely on GPT-4, which could in principle favor models whose outputs align with GPT-4's stylistic preferences. To address this, we note that the performance lift of Zephyr-7B over its dSFT baseline (which uses identical data but no preference optimization) is observed consistently across MT-Bench, AlpacaEval, and the Open LLM Leaderboard, where the latter two employ different judging protocols. This differential improvement suggests that dDPO contributes alignment gains beyond simple teacher-style matching. We did not collect new human-AIF correlation statistics in this work, as the manuscript focuses on the efficiency of the distillation pipeline rather than re-validating AIF; existing literature (e.g., on UltraFeedback) already reports moderate-to-high correlation for similar setups. In the revised version we will (i) qualify the abstract claim to specify the MT-Bench protocol, (ii) add a short paragraph in the evaluation section discussing the shared GPT-4 source and the cross-benchmark consistency, and (iii) include a limitations note on the absence of fresh human correlation data. revision: partial
Circularity Check
No significant circularity; empirical pipeline uses external teacher and independent benchmarks
full rationale
The paper's core claim is an empirical result: training Zephyr-7B via dSFT followed by dDPO on AIF rankings from an external teacher model, then measuring performance on public chat benchmarks (MT-Bench, etc.). No derivation step reduces a prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is internal to the paper. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes that smuggle in the target result. Minor self-citations to prior alignment work exist but are not load-bearing for the reported performance numbers.
Axiom & Free-Parameter Ledger
free parameters (1)
- learning rate and batch size
axioms (1)
- domain assumption AI Feedback preference rankings correlate sufficiently with human intent to produce aligned behavior
Forward citations
Cited by 19 Pith papers
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...
-
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Leveraging RAG for Training-Free Alignment of LLMs
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while r...
-
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
DART raises difference-awareness accuracy from 39% to 68.8% on benchmarks while cutting harm-drift cases by 72.6% and improving real-world appropriate responses from 39.8% to 77.5%.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
RankZephyr is a new open-source LLM that closes the effectiveness gap with GPT-4 for zero-shot listwise reranking while showing robustness to input ordering and document count.
-
Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents
Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.
-
Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
Continual pre-training on a German medical corpus lets 7B models close much of the performance gap with 24B general models on medical benchmarks, though merging introduces some language mixing and verbosity.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...
-
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Proximal Policy Optimization Algorithms
Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg. Proximal Policy Optimization Algorithms. arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Lila: A Unified Benchmark for Mathematical Reasoning
Mishra, Swaroop and Finlayson, Matthew and Lu, Pan and Tang, Leonard and Welleck, Sean and Baral, Chitta and Rajpurohit, Tanmay and Tafjord, Oyvind and Sabharwal, Ashish and Clark, Peter and Kalyan, Ashwin. Lila: A Unified Benchmark for Mathematical Reasoning. arXiv:2210.17517
-
[3]
The False Promise of Imitating Proprietary LLMs
Gudibande, Arnav and Wallace, Eric and Snell, Charlie and Geng, Xinyang and Liu, Hao and Abbeel, Pieter and Levine, Sergey and Song, Dawn. The False Promise of Imitating Proprietary LLMs. arXiv:2305.15717
-
[4]
Measuring Massive Multitask Language Understanding
Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob. Measuring Massive Multitask Language Understanding. arXiv [cs.CY]
-
[5]
MetaMath : Bootstrap Your Own Mathematical Questions for Large Language Models
Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang. MetaMath : Bootstrap Your Own Mathematical Questions for Large Language Models. arXiv preprint arXiv:2309. 12284
-
[6]
GPT-J-6B : A 6 billion parameter autoregressive language model
Wang, Ben and Komatsuzaki, Aran. GPT-J-6B : A 6 billion parameter autoregressive language model
-
[7]
Alpaca: A strong, replicable instruction-following model
Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html
work page 2023
-
[8]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Zheng, Zhi and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. arXiv:2305.14233
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
OpenAI. GPT-4 Technical Report. arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
UltraFeedback : Boosting Language Models with High-quality Feedback
Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong. UltraFeedback : Boosting Language Models with High-quality Feedback. arXiv:2310.01377
-
[11]
Nijkamp, Erik and Xie, Tian and Hayashi, Hiroaki and Pang, Bo and Xia, Congying and Xing, Chen and Vig, Jesse and Yavuz, Semih and Laban, Philippe and Krause, Ben and Purushwalkam, Senthil and Niu, Tong and Kry \'s ci \'n ski, Wojciech and Murakhovs'ka, Lidiya and Choubey, Prafulla Kumar and Fabbri, Alex and Liu, Ye and Meng, Rui and Tu, Lifu and Bhat, Me...
-
[12]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Constitutional AI: Harmlessness from AI Feedback
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Training language models to follow instructions with human feedback
Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ry...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Ferrer, Cristian Canton and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, L \'e lio Renard and Lachaux, Marie-Anne and Stock, Pierre and Le Scao, Teven and Lavril, Thibaut and Wang, Thomas and Lacroix...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Scaling Instruction-Finetuned Language Models
Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Yunxuan and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Castro-Ros, Alex and Pellat, Marie and Robinson, Kevin and V...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Self-Instruct : Aligning Language Models with Self-Generated Instructions
Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A and Khashabi, Daniel and Hajishirzi, Hannaneh. Self-Instruct : Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[19]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and Joseph, Nicholas and Kadavath, Saurav and Kernion, Jackson and Conerly, Tom and El-Showk, Sheer and Elhage, Nelson and Hatfield-Dodds, Zac and Hernandez, Danny and Hume, Tristan and...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
MAmmoTH : Building math generalist models through hybrid instruction tuning
Chen, Wenhu. MAmmoTH : Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309. 05653
-
[21]
Edward Beeching, Cl \'e mentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf. Open LLM Leaderboard
-
[22]
NEFTune : Noisy Embeddings Improve Instruction Finetuning
Jain, Neel and Chiang, Ping-Yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom. NEFTune : Noisy Embeddings Improve Instruction Finetuning. arXiv:2310.05914
-
[23]
TRL : Transformer Reinforcement Learning
von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi. TRL : Transformer Reinforcement Learning. GitHub repository
-
[24]
Team, Xwin-Lm. Xwin-LM
-
[25]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Sanh, Victor and Webson, Albert and Raffel, Colin and Bach, Stephen H and Sutawika, Lintang and Alyafeai, Zaid and Chaffin, Antoine and Stiegler, Arnaud and Le Scao, Teven and Raja, Arun and Dey, Manan and Saiful Bari, M and Xu, Canwen and Thakker, Urmish and Sharma, Shanya Sharma and Szczechla, Eliza and Kim, Taewoon and Chhablani, Gunjan and Nayak, Niha...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Wizardlm: Empowering large language models to follow complex instructions
Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304. 12244
-
[27]
AlpacaEval : An Automatic Evaluator of Instruction-following Models
Li, Xuechen and Zhang, Tianyi and Dubois, Yann and Taori, Rohan and Gulrajani, Ishaan and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. AlpacaEval : An Automatic Evaluator of Instruction-following Models. GitHub repository
-
[28]
A framework for few-shot language model evaluation
Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy. A framework for few-shot language model evaluation
-
[29]
Think you have Solved Question Answering? Try ARC , the AI2 Reasoning Challenge
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind. Think you have Solved Question Answering? Try ARC , the AI2 Reasoning Challenge. arXiv [cs.AI]
-
[30]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Ermon, Stefano and Manning, Christopher D and Finn, Chelsea. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ ChatGPT Quality
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E and Stoica, Ion and Xing, Eric P. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ ChatGPT Quality
-
[32]
HellaSwag : Can a Machine Really Finish Your Sentence?
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. HellaSwag : Can a Machine Really Finish Your Sentence?. arXiv [cs.CL]
-
[33]
TruthfulQA : Measuring How Models Mimic Human Falsehoods
Lin, Stephanie and Hilton, Jacob and Evans, Owain. TruthfulQA : Measuring How Models Mimic Human Falsehoods. arXiv [cs.CL]
-
[34]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R \'e mi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Dram...
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[35]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[36]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [37]
-
[38]
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , author=. 2023 , eprint=
work page 2023
-
[39]
Releasing 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned and chat models , author=. 2023 , url=
work page 2023
-
[40]
Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs , author=. 2023 , url=
work page 2023
-
[41]
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=
work page 2023
-
[42]
Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance , author=. 2023 , eprint=
work page 2023
-
[43]
ChatEval: A Tool for Chatbot Evaluation
Sedoc, Jo a o and Ippolito, Daphne and Kirubarajan, Arun and Thirani, Jai and Ungar, Lyle and Callison-Burch, Chris. ChatEval: A Tool for Chatbot Evaluation. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 2019
work page 2019
- [44]
-
[45]
Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf , title =. 2023 , publisher =
work page 2023
-
[46]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models , author=. 2020 , eprint=
work page 2020
-
[47]
LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=
work page 2021
-
[48]
Dao, Tri , year=. Flash
-
[49]
QLoRA: Efficient Finetuning of Quantized LLMs , author=. 2023 , eprint=
work page 2023
-
[50]
De Vries, Harm , title =
-
[51]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=
work page 2022
-
[52]
Together AI. Releasing 3b and 7b redpajama-incite family of models including base, instruction-tuned and chat models, 2023. URL https://together.ai/blog/redpajama-models-v1
work page 2023
-
[53]
Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page 2022
-
[54]
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023
work page 2023
-
[55]
Vicuna: An Open-Source chatbot impressing GPT-4 with 90\
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, Ion Stoica, and Eric P Xing. Vicuna: An Open-Source chatbot impressing GPT-4 with 90\
-
[56]
Scaling Instruction-Finetuned language models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping H...
work page 2022
-
[57]
Think you have solved question answering? try ARC , the AI2 reasoning challenge, 2018
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC , the AI2 reasoning challenge, 2018
work page 2018
-
[58]
UltraFeedback : Boosting language models with high-quality feedback
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback : Boosting language models with high-quality feedback. October 2023
work page 2023
-
[59]
Flash A ttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023
work page 2023
-
[60]
Harm De Vries. Go smol or go home, 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/
work page 2023
-
[61]
Qlora: Efficient finetuning of quantized llms, 2023
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023
work page 2023
-
[62]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. May 2023
work page 2023
- [63]
-
[64]
Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to measure large language models' reasoning performance, 2023
work page 2023
-
[65]
The false promise of imitating proprietary LLMs
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary LLMs . May 2023
work page 2023
-
[66]
Measuring massive multitask language understanding, 2021
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021
work page 2021
-
[67]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021
work page 2021
-
[68]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. Mistral 7B . October 2023
work page 2023
-
[69]
AlpacaEval : An automatic evaluator of instruction-following models, 2023
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. AlpacaEval : An automatic evaluator of instruction-following models, 2023
work page 2023
-
[70]
TruthfulQA : Measuring how models mimic human falsehoods, 2022
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA : Measuring how models mimic human falsehoods, 2022
work page 2022
-
[71]
Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023
Mosaic ML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL https://www.mosaicml.com/blog/mpt-7b
work page 2023
- [72]
-
[73]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. ...
work page 2022
-
[74]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023
work page 2023
-
[75]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. May 2023
work page 2023
-
[76]
Zero: Memory optimizations toward training trillion parameter models, 2020
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020
work page 2020
-
[77]
Multitask prompted training enables Zero-Shot task generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matte...
work page 2021
-
[78]
Proximal policy optimization algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. July 2017
work page 2017
-
[79]
Chateval: A tool for chatbot evaluation
Jo a o Sedoc, Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch. Chateval: A tool for chatbot evaluation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp.\ 60--65. Association for Computational Linguistics, 2019. URL http://aclweb.o...
work page 2019
-
[80]
Alpaca: A strong, replicable instruction-following model
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3 0 (6): 0 7, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.