Scaling Instruction-Finetuned Language Models

Aakanksha Chowdhery; Adam Roberts; Adams Yu; Albert Webson; Alex Castro-Ros; Andrew Dai; Barret Zoph; Dasha Valter; Denny Zhou; Ed H. Chi

arxiv: 2210.11416 · v5 · submitted 2022-10-20 · 💻 cs.LG · cs.CL

Scaling Instruction-Finetuned Language Models

Hyung Won Chung , Le Hou , Shayne Longpre , Barret Zoph , Yi Tay , William Fedus , Yunxuan Li , Xuezhi Wang

show 27 more authors

Mostafa Dehghani Siddhartha Brahma Albert Webson Shixiang Shane Gu Zhuyun Dai Mirac Suzgun Xinyun Chen Aakanksha Chowdhery Alex Castro-Ros Marie Pellat Kevin Robinson Dasha Valter Sharan Narang Gaurav Mishra Adams Yu Vincent Zhao Yanping Huang Andrew Dai Hongkun Yu Slav Petrov Ed H. Chi Jeff Dean Jacob Devlin Adam Roberts Denny Zhou Quoc V. Le Jason Wei

This is my paper

Pith reviewed 2026-05-12 01:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords instruction finetuninglanguage modelschain-of-thoughtfew-shot learningscalinggeneralizationMMLUPaLM

0 comments

The pith

Instruction finetuning on 1.8K tasks plus chain-of-thought data lifts PaLM 540B performance by 9.4 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether finetuning language models on datasets rewritten as natural language instructions improves their ability to handle new tasks. It scales this process by using more tasks, bigger models, and examples that require step-by-step reasoning. The largest model tested reaches new highs on several standard tests and shows gains whether used with no examples, a few examples, or reasoning chains. These patterns hold across different base models and suggest a straightforward way to make pretrained systems respond better to varied user requests.

Core claim

Finetuning pretrained language models on a collection of 1.8K tasks phrased as instructions, combined with scaling model size and adding chain-of-thought data, produces models such as Flan-PaLM 540B that outperform the base PaLM 540B by 9.4 percent on average and reach 75.2 percent on five-shot MMLU, with similar gains appearing for T5 and U-PaLM families across zero-shot, few-shot, and chain-of-thought evaluations.

What carries the argument

Instruction finetuning on a scaled set of datasets converted to instruction format, augmented by model size increases and chain-of-thought examples.

If this is right

The gains appear for PaLM, T5, and U-PaLM model families.
Improvements occur under zero-shot, few-shot, and chain-of-thought prompting.
Results advance on MMLU, BBH, TyDiQA, MGSM, and open-ended generation benchmarks.
Released Flan-T5 checkpoints match or exceed much larger models on few-shot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This style of finetuning could lower the amount of prompt crafting needed for new applications.
Smaller models trained the same way might narrow the performance gap with much larger base models.
Further increases in task diversity could produce even stronger generalization to unseen instructions.
Public checkpoints make it easier to test whether the same recipe works for other base models or languages.

Load-bearing premise

The particular 1.8K tasks and the chosen benchmarks capture enough of the space of possible instructions and real-world uses that the observed gains will hold for other tasks and settings.

What would settle it

Repeating the finetuning procedure on a fresh collection of 1.8K tasks drawn from unrelated domains and seeing no average gain on the original benchmarks or the new tasks would show the benefits do not generalize.

read the original abstract

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling instruction finetuning to 1.8K tasks and 540B models produces clear, consistent gains on standard benchmarks, with Flan-T5 checkpoints released for verification.

read the letter

The main takeaway is that instruction finetuning scales effectively: adding more tasks, chain-of-thought data, and larger models improves performance across zero-shot, few-shot, and CoT prompting on benchmarks like MMLU, BBH, and MGSM. Flan-PaLM 540B beats the base PaLM 540B by about 9 points on average and hits 75% on five-shot MMLU, which is new relative to earlier Flan work. The paper tests this across PaLM, T5, and U-PaLM families, which strengthens the claim that the gains are not tied to one architecture. Releasing the Flan-T5 checkpoints is useful because it lets others check the smaller-model results directly. The empirical pattern holds up on the reported numbers and setups. The task mixture and evaluation benchmarks are specific choices, so the results show what works on these held-out sets rather than proving broad generalization to any real-world instruction. Training details for the largest runs are not fully public, which limits exact replication at PaLM scale, but the smaller models provide a concrete starting point. This work is aimed at researchers and practitioners who train or adapt language models and want practical evidence on how to boost usability without changing the base pretraining. The results are solid enough on their own terms to merit peer review, even if some readers will want more on the exact task selection process or longer-term generalization tests.

Referee Report

1 major / 3 minor

Summary. The paper claims that scaling instruction finetuning to 1.8K tasks, larger model sizes, and chain-of-thought data yields consistent performance gains across model families (PaLM, T5, U-PaLM), prompting setups (zero/few-shot, CoT), and benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). Key results include Flan-PaLM 540B outperforming base PaLM 540B by +9.4% on average and achieving 75.2% on five-shot MMLU, with public release of Flan-T5 checkpoints that show strong few-shot performance relative to larger models.

Significance. If the empirical results hold, this work demonstrates that instruction finetuning is a general, scalable method for improving pretrained language model performance and usability without task-specific adaptation. The consistent gains across multiple architectures and the release of Flan-T5 checkpoints are particular strengths, enabling direct verification and community follow-up on smaller scales. This advances practical techniques for model alignment and generalization.

major comments (1)

[Experiments] Experiments section (around the 1.8K-task scaling results): the central claim that scaling the number of tasks 'dramatically improves' performance would be strengthened by an explicit ablation or curve showing performance as a function of task count (e.g., up to 1.8K), as the current reporting focuses on the final scale without isolating the contribution of task volume versus other factors like CoT data.

minor comments (3)

[Abstract] Abstract: the phrase 'dramatically improves' is subjective; replace with a quantified statement such as 'substantially improves' to match the concrete deltas reported elsewhere.
[Results] Results tables: ensure all reported averages (e.g., the +9.4% figure) are accompanied by the exact list of benchmarks included in the average so readers can assess consistency.
[Introduction] Notation: define 'Flan-PaLM' and 'Flan-T5' explicitly on first use, including the distinction from base models and the role of the instruction collection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the work's significance, and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section (around the 1.8K-task scaling results): the central claim that scaling the number of tasks 'dramatically improves' performance would be strengthened by an explicit ablation or curve showing performance as a function of task count (e.g., up to 1.8K), as the current reporting focuses on the final scale without isolating the contribution of task volume versus other factors like CoT data.

Authors: We agree that an explicit scaling curve with respect to task count would strengthen the central claim and better isolate the contribution of task volume from factors such as CoT data. In the revised manuscript we add a new figure (in the Experiments section) that reports average performance across the evaluation suite as a function of the number of finetuning tasks. The curve is generated by training on random subsets of the 1.8K tasks at multiple scales (e.g., 100, 500, 1K, 1.8K) while holding model size and CoT inclusion fixed; separate curves are shown for the CoT and non-CoT settings. The added analysis confirms consistent gains with increasing task count and is now referenced in the main text when discussing the 1.8K-task results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports direct empirical measurements of instruction finetuning on 1.8K tasks across model scales, with performance deltas computed on held-out benchmarks (MMLU, BBH, etc.). No derivations, equations, or first-principles predictions are claimed; results are obtained by training and evaluating on standard splits. Self-citations to prior FLAN work exist but are not load-bearing for the scaling claims, which rest on new experimental data rather than reducing to fitted parameters or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical scaling study. No mathematical axioms, free parameters fitted to the central performance claims, or invented entities are introduced; standard ML training hyperparameters are used but do not constitute load-bearing free parameters for the reported gains.

pith-pipeline@v0.9.0 · 5669 in / 1213 out tokens · 48038 ms · 2026-05-12T01:17:46.327451+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Mind2Web: Towards a Generalist Agent for the Web
cs.CL 2023-06 accept novelty 8.0

Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Reflective Flow Sampling Enhancement
cs.CV 2026-03 unverdicted novelty 7.0

RF-Sampling enhances flow matching models by implicitly performing gradient ascent on text-image alignment scores via linear textual combinations and flow inversion.
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
cs.HC 2026-01 unverdicted novelty 7.0

Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
cs.SD 2026-01 unverdicted novelty 7.0

A single DiT-based diffusion model unifies video-to-audio, text-to-audio, and joint video-text-to-audio generation, supported by a new 470k-pair dataset and three-stage progressive training that resolves task competition.
MetaLint: Easy-to-Hard Generalization for Code Linting
cs.SE 2025-07 unverdicted novelty 7.0

MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
cs.LG 2024-01 accept novelty 7.0

VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
cs.CV 2023-12 conditional novelty 7.0

Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
cs.RO 2023-10 conditional novelty 7.0

SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
cs.CL 2023-10 conditional novelty 7.0

Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LIMA: Less Is More for Alignment
cs.CL 2023-05 conditional novelty 7.0

Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
cs.CV 2023-05 conditional novelty 7.0

Instruction tuning of BLIP-2 with an instruction-aware Query Transformer delivers state-of-the-art zero-shot performance on held-out vision-language datasets and strong finetuned results on downstream tasks.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
WizardLM: Empowering large pre-trained language models to follow complex instructions
cs.CL 2023-04 conditional novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
cs.LG 2026-05 unverdicted novelty 6.0

Weasel is a trajectory selection method that optimizes importance-diversity for offline web-agent training, improving out-of-domain generalization and delivering 9.7-12.5x speedups on AgentTrek, NNetNav, WebArena, Wor...
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
cs.CL 2026-04 unverdicted novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
cs.AI 2026-04 unverdicted novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS uses decision-tree scaffolds to fine-tune LLMs for faithful tabular reasoning, reporting up to 10% gains over baselines on medical and financial data.
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
cs.AI 2026-04 unverdicted novelty 6.0

ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and fina...
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
cs.CV 2025-09 unverdicted novelty 6.0

EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluat...
GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
cs.PL 2025-09 conditional novelty 6.0

GraphMend uses two Jaseci-based code transformations to eliminate dynamic-control-flow and side-effect graph breaks in PyTorch 2, reducing breaks to zero in six of eight Hugging Face models and yielding up to 75% late...
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
cs.CR 2025-07 unverdicted novelty 6.0

Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
cs.AI 2024-08 conditional novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
A Survey on Vision-Language-Action Models for Embodied AI
cs.RO 2024-05 unverdicted novelty 6.0

This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
Laissez-Faire Harms: Algorithmic Biases in Generative Language Models
cs.CL 2024-04 unverdicted novelty 6.0

Generative LMs in laissez-faire open-ended prompting settings disproportionately generate subordinated portrayals of minoritized race, gender, and sexual orientation identities at rates hundreds to thousands of times ...
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
cs.CL 2024-02 conditional novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection
cs.CL 2024-02 unverdicted novelty 6.0

Systematic LLM evaluation for news framing detection reveals prompt sensitivity and emotional-language bias, introduces an out-of-domain headline dataset, and shows cross-model consensus aids annotation auditing.
GPT-4V(ision) is a Generalist Web Agent, if Grounded
cs.IR 2024-01 conditional novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
An Embodied Generalist Agent in 3D World
cs.CV 2023-11 unverdicted novelty 6.0

LEO is an embodied generalist agent that performs 3D captioning, question answering, reasoning, navigation, and manipulation after 3D vision-language alignment followed by vision-language-action instruction tuning on ...
Zephyr: Direct Distillation of LM Alignment
cs.LG 2023-10 accept novelty 6.0

Zephyr-7B achieves state-of-the-art chat benchmark results among 7B models by distilling alignment via dDPO on AI feedback preferences, surpassing the 70B Llama-2-Chat model on MT-Bench with no human data required.
SALMONN: Towards Generic Hearing Abilities for Large Language Models
cs.SD 2023-10 unverdicted novelty 6.0

SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
Aligning Large Multimodal Models with Factually Augmented RLHF
cs.CV 2023-09 conditional novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
cs.CL 2023-09 conditional novelty 6.0

MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
Simple synthetic data reduces sycophancy in large language models
cs.CL 2023-08 unverdicted novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
MiniLLM: On-Policy Distillation of Large Language Models
cs.CL 2023-06 conditional novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
cs.CL 2023-06 conditional novelty 6.0

AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
Scaling Data-Constrained Language Models
cs.CL 2023-05 conditional novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
The False Promise of Imitating Proprietary LLMs
cs.CL 2023-05 conditional novelty 6.0

Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.