A Multiscale Visualization of Attention in the Transformer Model

Jesse Vig

arxiv: 1906.05714 · v1 · pith:5DATB7CFnew · submitted 2019-06-12 · 💻 cs.HC · cs.CL· cs.LG

A Multiscale Visualization of Attention in the Transformer Model

Jesse Vig This is my paper

classification 💻 cs.HC cs.CLcs.LG

keywords modelattentiontransformermechanismtoolaccessibleadvantageapproach

0 comments

read the original abstract

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model by showing how the model assigns weight to different input elements. However, the multi-layer, multi-head attention mechanism in the Transformer model can be difficult to decipher. To make the model more accessible, we introduce an open-source tool that visualizes attention at multiple scales, each of which provides a unique perspective on the attention mechanism. We demonstrate the tool on BERT and OpenAI GPT-2 and present three example use cases: detecting model bias, locating relevant attention heads, and linking neurons to model behavior.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
cs.IR 2024-09 unverdicted novelty 7.0

Introduces Trust-RAG Compass framework and TRC Bench benchmark to assess RAG trustworthiness across factuality, robustness, fairness, transparency, accountability, and privacy, with evaluations showing performance gap...
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
Rethinking Attention with Performers
cs.LG 2020-09 unverdicted novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
cs.CL 2026-02 unverdicted novelty 6.0

CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Realistic Channel Models Pre-training
eess.SP 2019-07 unverdicted novelty 5.0

A self-supervised pre-trained neural network with multi-domain channel embedding and self-attention is proposed to create realistic wireless channel models combining deterministic accuracy and stochastic uniformity.
DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration
cs.CL 2023-11 unverdicted novelty 4.0

DA-Cramming inserts chunk-level dependency agreement embeddings into a dual-stage pretraining pipeline and reports better downstream performance than prior Cramming baselines.