arxiv: 2210.11416 · v5 · submitted 2022-10-20 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

Scaling Instruction-Finetuned Language Models

Hyung Won Chung , Le Hou , Shayne Longpre , Barret Zoph , Yi Tay , William Fedus , Yunxuan Li , Xuezhi Wang , Mostafa Dehghani , Siddhartha Brahma , Albert Webson , Shixiang Shane Gu , Zhuyun Dai , Mirac Suzgun , Xinyun Chen , Aakanksha Chowdhery , Alex Castro-Ros , Marie Pellat , Kevin Robinson , Dasha Valter , Sharan Narang , Gaurav Mishra , Adams Yu , Vincent Zhao , Yanping Huang , Andrew Dai , Hongkun Yu , Slav Petrov , Ed H. Chi , Jeff Dean , Jacob Devlin , Adam Roberts , Denny Zhou , Quoc V. Le , Jason Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords instruction finetuninglanguage modelschain-of-thoughtfew-shot learningscalinggeneralizationMMLUPaLM

0 comments

The pith

Instruction finetuning on 1.8K tasks plus chain-of-thought data lifts PaLM 540B performance by 9.4 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether finetuning language models on datasets rewritten as natural language instructions improves their ability to handle new tasks. It scales this process by using more tasks, bigger models, and examples that require step-by-step reasoning. The largest model tested reaches new highs on several standard tests and shows gains whether used with no examples, a few examples, or reasoning chains. These patterns hold across different base models and suggest a straightforward way to make pretrained systems respond better to varied user requests.

Core claim

Finetuning pretrained language models on a collection of 1.8K tasks phrased as instructions, combined with scaling model size and adding chain-of-thought data, produces models such as Flan-PaLM 540B that outperform the base PaLM 540B by 9.4 percent on average and reach 75.2 percent on five-shot MMLU, with similar gains appearing for T5 and U-PaLM families across zero-shot, few-shot, and chain-of-thought evaluations.

What carries the argument

Instruction finetuning on a scaled set of datasets converted to instruction format, augmented by model size increases and chain-of-thought examples.

If this is right

The gains appear for PaLM, T5, and U-PaLM model families.
Improvements occur under zero-shot, few-shot, and chain-of-thought prompting.
Results advance on MMLU, BBH, TyDiQA, MGSM, and open-ended generation benchmarks.
Released Flan-T5 checkpoints match or exceed much larger models on few-shot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This style of finetuning could lower the amount of prompt crafting needed for new applications.
Smaller models trained the same way might narrow the performance gap with much larger base models.
Further increases in task diversity could produce even stronger generalization to unseen instructions.
Public checkpoints make it easier to test whether the same recipe works for other base models or languages.

Load-bearing premise

The particular 1.8K tasks and the chosen benchmarks capture enough of the space of possible instructions and real-world uses that the observed gains will hold for other tasks and settings.

What would settle it

Repeating the finetuning procedure on a fresh collection of 1.8K tasks drawn from unrelated domains and seeing no average gain on the original benchmarks or the new tasks would show the benefits do not generalize.

read the original abstract

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling instruction finetuning to 1.8K tasks and 540B models produces clear, consistent gains on standard benchmarks, with Flan-T5 checkpoints released for verification.

read the letter

The main takeaway is that instruction finetuning scales effectively: adding more tasks, chain-of-thought data, and larger models improves performance across zero-shot, few-shot, and CoT prompting on benchmarks like MMLU, BBH, and MGSM. Flan-PaLM 540B beats the base PaLM 540B by about 9 points on average and hits 75% on five-shot MMLU, which is new relative to earlier Flan work. The paper tests this across PaLM, T5, and U-PaLM families, which strengthens the claim that the gains are not tied to one architecture. Releasing the Flan-T5 checkpoints is useful because it lets others check the smaller-model results directly. The empirical pattern holds up on the reported numbers and setups. The task mixture and evaluation benchmarks are specific choices, so the results show what works on these held-out sets rather than proving broad generalization to any real-world instruction. Training details for the largest runs are not fully public, which limits exact replication at PaLM scale, but the smaller models provide a concrete starting point. This work is aimed at researchers and practitioners who train or adapt language models and want practical evidence on how to boost usability without changing the base pretraining. The results are solid enough on their own terms to merit peer review, even if some readers will want more on the exact task selection process or longer-term generalization tests.

Referee Report

1 major / 3 minor

Summary. The paper claims that scaling instruction finetuning to 1.8K tasks, larger model sizes, and chain-of-thought data yields consistent performance gains across model families (PaLM, T5, U-PaLM), prompting setups (zero/few-shot, CoT), and benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). Key results include Flan-PaLM 540B outperforming base PaLM 540B by +9.4% on average and achieving 75.2% on five-shot MMLU, with public release of Flan-T5 checkpoints that show strong few-shot performance relative to larger models.

Significance. If the empirical results hold, this work demonstrates that instruction finetuning is a general, scalable method for improving pretrained language model performance and usability without task-specific adaptation. The consistent gains across multiple architectures and the release of Flan-T5 checkpoints are particular strengths, enabling direct verification and community follow-up on smaller scales. This advances practical techniques for model alignment and generalization.

major comments (1)

[Experiments] Experiments section (around the 1.8K-task scaling results): the central claim that scaling the number of tasks 'dramatically improves' performance would be strengthened by an explicit ablation or curve showing performance as a function of task count (e.g., up to 1.8K), as the current reporting focuses on the final scale without isolating the contribution of task volume versus other factors like CoT data.

minor comments (3)

[Abstract] Abstract: the phrase 'dramatically improves' is subjective; replace with a quantified statement such as 'substantially improves' to match the concrete deltas reported elsewhere.
[Results] Results tables: ensure all reported averages (e.g., the +9.4% figure) are accompanied by the exact list of benchmarks included in the average so readers can assess consistency.
[Introduction] Notation: define 'Flan-PaLM' and 'Flan-T5' explicitly on first use, including the distinction from base models and the role of the instruction collection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the work's significance, and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section (around the 1.8K-task scaling results): the central claim that scaling the number of tasks 'dramatically improves' performance would be strengthened by an explicit ablation or curve showing performance as a function of task count (e.g., up to 1.8K), as the current reporting focuses on the final scale without isolating the contribution of task volume versus other factors like CoT data.

Authors: We agree that an explicit scaling curve with respect to task count would strengthen the central claim and better isolate the contribution of task volume from factors such as CoT data. In the revised manuscript we add a new figure (in the Experiments section) that reports average performance across the evaluation suite as a function of the number of finetuning tasks. The curve is generated by training on random subsets of the 1.8K tasks at multiple scales (e.g., 100, 500, 1K, 1.8K) while holding model size and CoT inclusion fixed; separate curves are shown for the CoT and non-CoT settings. The added analysis confirms consistent gains with increasing task count and is now referenced in the main text when discussing the 1.8K-task results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports direct empirical measurements of instruction finetuning on 1.8K tasks across model scales, with performance deltas computed on held-out benchmarks (MMLU, BBH, etc.). No derivations, equations, or first-principles predictions are claimed; results are obtained by training and evaluating on standard splits. Self-citations to prior FLAN work exist but are not load-bearing for the scaling claims, which rest on new experimental data rather than reducing to fitted parameters or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical scaling study. No mathematical axioms, free parameters fitted to the central performance claims, or invented entities are introduced; standard ML training hyperparameters are used but do not constitute load-bearing free parameters for the reported gains.

pith-pipeline@v0.9.0 · 5669 in / 1213 out tokens · 48038 ms · 2026-05-12T01:17:46.327451+00:00 · methodology

discussion (0)

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
cs.CV 2023-05 conditional novelty 7.0

Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
WizardLM: Empowering large pre-trained language models to follow complex instructions
cs.CL 2023-04 conditional novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
cs.CL 2026-04 unverdicted novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
cs.AI 2026-04 unverdicted novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
MiniLLM: On-Policy Distillation of Large Language Models
cs.CL 2023-06 conditional novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
cs.CL 2023-03 unverdicted novelty 6.0

HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
Multimodal Chain-of-Thought Reasoning in Language Models
cs.CL 2023-02 accept novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
cs.CV 2026-05 unverdicted novelty 5.0

A closed-loop system couples LLM-based 3D scene generation with RL optimization and VR user interactions to produce adaptive, immersive environments, claiming SOTA results on the ALFRED benchmark.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling
cs.CL 2026-04 unverdicted novelty 5.0

Marco-MoE delivers open multilingual MoE models with 5% activation sparsity that outperform similarly sized dense models on English and multilingual benchmarks through efficient upcycling.
Testing the Assumptions of Active Learning for Translation Tasks with Few Samples
cs.CL 2026-04 unverdicted novelty 5.0

Informativeness and diversity of samples selected by active learning show no correlation with test performance on translation tasks using few samples; ordering and pre-training effects dominate instead.
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
cs.CL 2026-04 unverdicted novelty 5.0

Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models
cs.AI 2026-04 unverdicted novelty 5.0

JCQL uses an SLM-trained KBC model as an action in an LLM agent for KBQA to reduce hallucinations, then fine-tunes the KBC model with KBQA reasoning paths, outperforming baselines on two benchmarks.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.