hub Canonical reference

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai · 2021 · cs.LG · arXiv 2110.08207

Canonical reference. 80% of citing Pith papers cite this work as background.

50 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 50 citing papers arXiv PDF

abstract

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 1 method 1

citation-polarity summary

background 12 support 2 baseline 1

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.

MetaLint: Easy-to-Hard Generalization for Code Linting

cs.SE · 2025-07-15 · unverdicted · novelty 7.0

MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.

Towards Agentic Runtime Healing

cs.SE · 2024-08-02 · unverdicted · novelty 7.0

Healer uses LLMs to dynamically generate and execute runtime error-handling code, with GPT-4 recovering from 72.8% of errors across four datasets.

Deep Time Series Models: A Comprehensive Survey and Benchmark

cs.LG · 2024-07-18 · unverdicted · novelty 7.0

This survey and benchmark of deep time series models using the released TSLib library finds that models with specific structures perform well only on distinct analysis tasks.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

cs.CL · 2023-09-15 · unverdicted · novelty 7.0

EvoPrompt uses LLMs to run evolutionary operators on populations of prompts, outperforming human-engineered prompts by up to 25% on BIG-Bench Hard tasks across 31 datasets.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

QLoRA: Efficient Finetuning of Quantized LLMs

cs.LG · 2023-05-23 · conditional · novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

Understanding and Accelerating the Training of Masked Diffusion Language Models

cs.LG · 2026-05-13 · conditional · novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.

Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

cs.LG · 2026-05-09 · conditional · novelty 6.0 · 2 refs

ZO-MOPI accelerates zeroth-order LLM fine-tuning by applying partial spectral orthogonalization from power iteration inside a momentum-projected subspace to reduce variance and exploit dominant directions.

Understanding the Mechanism of Altruism in Large Language Models

econ.GN · 2026-04-21 · unverdicted · novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

MiniMax-01: Scaling Foundation Models with Lightning Attention

cs.CL · 2025-01-14 · unverdicted · novelty 6.0

MiniMax-01 models match GPT-4o and Claude-3.5-Sonnet performance while providing 20-32 times longer context windows through lightning attention and MoE scaling.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

cs.CL · 2024-06-03 · conditional · novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

Zephyr: Direct Distillation of LM Alignment

cs.LG · 2023-10-25 · accept · novelty 6.0

Zephyr-7B achieves state-of-the-art chat benchmark results among 7B models by distilling alignment via dDPO on AI feedback preferences, surpassing the 70B Llama-2-Chat model on MT-Bench with no human data required.

Llemma: An Open Language Model For Mathematics

cs.CL · 2023-10-16 · unverdicted · novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 139 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning cs.AI · 2023-01-31 · conditional · none · ref 51 · internal anchor
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.

Multitask Prompted Training Enables Zero-Shot Task Generalization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer