A Multiscale Visualization of Attention in the Transformer Model

Jesse Vig · 2019 · cs.HC · arXiv 1906.05714

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model by showing how the model assigns weight to different input elements. However, the multi-layer, multi-head attention mechanism in the Transformer model can be difficult to decipher. To make the model more accessible, we introduce an open-source tool that visualizes attention at multiple scales, each of which provides a unique perspective on the attention mechanism. We demonstrate the tool on BERT and OpenAI GPT-2 and present three example use cases: detecting model bias, locating relevant attention heads, and linking neurons to model behavior.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

cs.IR · 2024-09-16 · unverdicted · novelty 7.0

Introduces Trust-RAG Compass framework and TRC Bench benchmark to assess RAG trustworthiness across factuality, robustness, fairness, transparency, accountability, and privacy, with evaluations showing performance gaps between LLMs.

In-context Learning and Induction Heads

cs.LG · 2022-09-24 · unverdicted · novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.

Rethinking Attention with Performers

cs.LG · 2020-09-30 · unverdicted · novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

cs.CL · 2026-02-18 · unverdicted · novelty 6.0

CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.

Realistic Channel Models Pre-training

eess.SP · 2019-07-22 · unverdicted · novelty 5.0

A self-supervised pre-trained neural network with multi-domain channel embedding and self-attention is proposed to create realistic wireless channel models combining deterministic accuracy and stochastic uniformity.

TIDE: Every Layer Knows the Token Beneath the Context

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration

cs.CL · 2023-11-08 · unverdicted · novelty 4.0

DA-Cramming inserts chunk-level dependency agreement embeddings into a dual-stage pretraining pipeline and reports better downstream performance than prior Cramming baselines.

citing papers explorer

Showing 7 of 7 citing papers.

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey cs.IR · 2024-09-16 · unverdicted · none · ref 121 · internal anchor
Introduces Trust-RAG Compass framework and TRC Bench benchmark to assess RAG trustworthiness across factuality, robustness, fairness, transparency, accountability, and privacy, with evaluations showing performance gaps between LLMs.
In-context Learning and Induction Heads cs.LG · 2022-09-24 · unverdicted · none · ref 8
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
Rethinking Attention with Performers cs.LG · 2020-09-30 · unverdicted · none · ref 155
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and protein tasks.
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models cs.CL · 2026-02-18 · unverdicted · none · ref 36 · internal anchor
CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
Realistic Channel Models Pre-training eess.SP · 2019-07-22 · unverdicted · none · ref 11 · internal anchor
A self-supervised pre-trained neural network with multi-domain channel embedding and self-attention is proposed to create realistic wireless channel models combining deterministic accuracy and stochastic uniformity.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 70
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration cs.CL · 2023-11-08 · unverdicted · none · ref 28 · internal anchor
DA-Cramming inserts chunk-level dependency agreement embeddings into a dual-stage pretraining pipeline and reports better downstream performance than prior Cramming baselines.

A Multiscale Visualization of Attention in the Transformer Model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer