The LAMBADA dataset: Word prediction requiring a broad discourse context

(2) Institute for Logic, Angeliki Lazaridou (1), Denis Paperno (1), Gemma Boleda (1), Germ\'an Kruszewski (1), Language & Computation, Marco Baroni (1), Quan Ngoc Pham (1), Raffaella Bernardi (1), Raquel Fern\'andez (2) ((1) CIMeC - Center for Mind/Brain Sciences, Sandro Pezzelle (1), University of Amsterdam), University of Trento

Authors on Pith no claims yet

classification 💻 cs.CL cs.AIcs.LG

keywords lambadamodelswordcontextablebroadcomputationaldataset

0 comments

read the original abstract

We introduce LAMBADA, a dataset to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse. We show that LAMBADA exemplifies a wide range of linguistic phenomena, and that none of several state-of-the-art language models reaches accuracy above 1% on this novel benchmark. We thus propose LAMBADA as a challenging test set, meant to encourage the development of new models capable of genuine understanding of broad context in natural language text.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Scaling Laws for Mixture Pretraining Under Data Constraints
cs.LG 2026-05 conditional novelty 7.0

Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory
cs.LG 2026-05 unverdicted novelty 7.0

KVM is a novel block-recurrent compressed memory for attention that unifies expandable transformer context with linear RNN efficiency, enabling competitive long-context performance with released code and models.
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
cs.CL 2026-04 unverdicted novelty 7.0

MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
cs.LG 2022-10 unverdicted novelty 7.0

GPTQ quantizes 175B-parameter GPT models to 3-4 bits per weight in one shot using approximate second-order information, achieving negligible accuracy degradation and 3-4x inference speedups.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Mixture of Heterogeneous Grouped Experts for Language Modeling
cs.CL 2026-04 unverdicted novelty 6.0

MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.
Strix: Re-thinking NPU Reliability from a System Perspective
cs.AR 2026-04 unverdicted novelty 6.0

Strix delivers sub-microsecond fault localisation, detection, and correction on NPUs with 1.04x slowdown and minimal hardware cost by system-level re-partitioning and targeted safeguards.
Attention to Mamba: A Recipe for Cross-Architecture Distillation
cs.CL 2026-04 unverdicted novelty 6.0

A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory
cs.LG 2026-05 unverdicted novelty 5.0

KVM is a novel attention block-recurrence that unifies transformer expandability with linear RNN efficiency, enabling O(N) fixed-state operation or sublinear state growth with subquadratic prefill using only standard ...
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
cs.LG 2026-04 unverdicted novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
cs.LG 2026-04 unverdicted novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators
cs.AR 2026-04 unverdicted novelty 5.0

An aging-aware adaptive voltage scaling framework for AI accelerators reduces predicted threshold voltage shifts by ~19% and aging degradation by up to 46% while saving 14% lifetime power by leveraging neural network ...
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.