Scaling Instruction-Finetuned Language Models
Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3
The pith
Instruction finetuning on 1.8K tasks plus chain-of-thought data lifts PaLM 540B performance by 9.4 percent on average.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Finetuning pretrained language models on a collection of 1.8K tasks phrased as instructions, combined with scaling model size and adding chain-of-thought data, produces models such as Flan-PaLM 540B that outperform the base PaLM 540B by 9.4 percent on average and reach 75.2 percent on five-shot MMLU, with similar gains appearing for T5 and U-PaLM families across zero-shot, few-shot, and chain-of-thought evaluations.
What carries the argument
Instruction finetuning on a scaled set of datasets converted to instruction format, augmented by model size increases and chain-of-thought examples.
If this is right
- The gains appear for PaLM, T5, and U-PaLM model families.
- Improvements occur under zero-shot, few-shot, and chain-of-thought prompting.
- Results advance on MMLU, BBH, TyDiQA, MGSM, and open-ended generation benchmarks.
- Released Flan-T5 checkpoints match or exceed much larger models on few-shot tasks.
Where Pith is reading between the lines
- This style of finetuning could lower the amount of prompt crafting needed for new applications.
- Smaller models trained the same way might narrow the performance gap with much larger base models.
- Further increases in task diversity could produce even stronger generalization to unseen instructions.
- Public checkpoints make it easier to test whether the same recipe works for other base models or languages.
Load-bearing premise
The particular 1.8K tasks and the chosen benchmarks capture enough of the space of possible instructions and real-world uses that the observed gains will hold for other tasks and settings.
What would settle it
Repeating the finetuning procedure on a fresh collection of 1.8K tasks drawn from unrelated domains and seeing no average gain on the original benchmarks or the new tasks would show the benefits do not generalize.
read the original abstract
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that scaling instruction finetuning to 1.8K tasks, larger model sizes, and chain-of-thought data yields consistent performance gains across model families (PaLM, T5, U-PaLM), prompting setups (zero/few-shot, CoT), and benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). Key results include Flan-PaLM 540B outperforming base PaLM 540B by +9.4% on average and achieving 75.2% on five-shot MMLU, with public release of Flan-T5 checkpoints that show strong few-shot performance relative to larger models.
Significance. If the empirical results hold, this work demonstrates that instruction finetuning is a general, scalable method for improving pretrained language model performance and usability without task-specific adaptation. The consistent gains across multiple architectures and the release of Flan-T5 checkpoints are particular strengths, enabling direct verification and community follow-up on smaller scales. This advances practical techniques for model alignment and generalization.
major comments (1)
- [Experiments] Experiments section (around the 1.8K-task scaling results): the central claim that scaling the number of tasks 'dramatically improves' performance would be strengthened by an explicit ablation or curve showing performance as a function of task count (e.g., up to 1.8K), as the current reporting focuses on the final scale without isolating the contribution of task volume versus other factors like CoT data.
minor comments (3)
- [Abstract] Abstract: the phrase 'dramatically improves' is subjective; replace with a quantified statement such as 'substantially improves' to match the concrete deltas reported elsewhere.
- [Results] Results tables: ensure all reported averages (e.g., the +9.4% figure) are accompanied by the exact list of benchmarks included in the average so readers can assess consistency.
- [Introduction] Notation: define 'Flan-PaLM' and 'Flan-T5' explicitly on first use, including the distinction from base models and the role of the instruction collection.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, recognition of the work's significance, and recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section (around the 1.8K-task scaling results): the central claim that scaling the number of tasks 'dramatically improves' performance would be strengthened by an explicit ablation or curve showing performance as a function of task count (e.g., up to 1.8K), as the current reporting focuses on the final scale without isolating the contribution of task volume versus other factors like CoT data.
Authors: We agree that an explicit scaling curve with respect to task count would strengthen the central claim and better isolate the contribution of task volume from factors such as CoT data. In the revised manuscript we add a new figure (in the Experiments section) that reports average performance across the evaluation suite as a function of the number of finetuning tasks. The curve is generated by training on random subsets of the 1.8K tasks at multiple scales (e.g., 100, 500, 1K, 1.8K) while holding model size and CoT inclusion fixed; separate curves are shown for the CoT and non-CoT settings. The added analysis confirms consistent gains with increasing task count and is now referenced in the main text when discussing the 1.8K-task results. revision: yes
Circularity Check
No significant circularity
full rationale
The paper reports direct empirical measurements of instruction finetuning on 1.8K tasks across model scales, with performance deltas computed on held-out benchmarks (MMLU, BBH, etc.). No derivations, equations, or first-principles predictions are claimed; results are obtained by training and evaluating on standard splits. Self-citations to prior FLAN work exist but are not load-bearing for the scaling claims, which rest on new experimental data rather than reducing to fitted parameters or self-referential definitions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Mind2Web: Towards a Generalist Agent for the Web
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
-
Instruction Tuning with GPT-4
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
Reflective Flow Sampling Enhancement
RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.
-
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
-
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
-
MetaLint: Easy-to-Hard Generalization for Code Linting
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.
-
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
LIMA: Less Is More for Alignment
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
-
Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
Weasel is a trajectory selection method that optimizes importance-diversity for offline web-agent training, improving out-of-domain generalization and delivering 9.7-12.5x speedups on AgentTrek, NNetNav, WebArena, Wor...
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
-
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...
-
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and fina...
-
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluat...
-
GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
GraphMend uses two Jaseci-based code transformations to eliminate dynamic-control-flow and side-effect graph breaks in PyTorch 2, reducing breaks to zero in six of eight Hugging Face models and yielding up to 75% late...
-
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
A Survey on Vision-Language-Action Models for Embodied AI
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
-
Laissez-Faire Harms: Algorithmic Biases in Generative Language Models
Generative LMs in laissez-faire open-ended prompting settings disproportionately generate subordinated portrayals of minoritized race, gender, and sexual orientation identities at rates hundreds to thousands of times ...
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection
Systematic LLM evaluation for news framing detection reveals prompt sensitivity and emotional-language bias, introduces an out-of-domain headline dataset, and shows cross-model consensus aids annotation auditing.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
-
An Embodied Generalist Agent in 3D World
LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on ...
-
Zephyr: Direct Distillation of LM Alignment
Zephyr-7B achieves state-of-the-art chat benchmark results among 7B models by distilling alignment via dDPO on AI feedback preferences, surpassing the 70B Llama-2-Chat model on MT-Bench with no human data required.
-
SALMONN: Towards Generic Hearing Abilities for Large Language Models
SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
-
Aligning Large Multimodal Models with Factually Augmented RLHF
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
-
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
-
Simple synthetic data reduces sycophancy in large language models
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
The False Promise of Imitating Proprietary LLMs
Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.