What Does BERT Look At? An Analysis of BERT's Attention

Kevin Clark , Urvashi Khandelwal , Omer Levy , Christopher D. Manning

Authors on Pith no claims yet

classification 💻 cs.CL

keywords bertattentionheadsanalysisattendingfurtherlanguagemodel

read the original abstract

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
A framework for analyzing concept representations in neural models
cs.CL 2026-05 unverdicted novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
cs.CL 2026-05 unverdicted novelty 7.0

MTA improves LLM knowledge distillation by aligning representations along layer-wise trajectories with adaptive granularity from words to phrases using dynamic structural and hidden representation alignment losses.
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
cs.IR 2026-04 unverdicted novelty 7.0

HeadRank improves decoding-free passage reranking by preference-aligning attention heads to increase discriminability in middle-context documents, outperforming baselines on 14 benchmarks with only 211 training queries.
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
Longformer: The Long-Document Transformer
cs.CL 2020-04 accept novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
Why Retrieval-Augmented Generation Fails: A Graph Perspective
cs.CL 2026-05 unverdicted novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
cs.LG 2026-05 unverdicted novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
From Words to Amino Acids: Does the Curse of Depth Persist?
cs.LG 2026-02 unverdicted novelty 6.0

Protein language models exhibit consistent depth inefficiency where most task-relevant computation occurs in a subset of layers, mirroring patterns in large language models.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Generating Synthetic Malware Samples Using Generative AI
cs.LG 2026-04 conditional novelty 4.0

Opcode-sequence generative models produce synthetic malware data that raises minor-class classification accuracy by up to 60% and overall detection to 96%.