pith. sign in

arxiv: 1906.05714 · v1 · pith:5DATB7CFnew · submitted 2019-06-12 · 💻 cs.HC · cs.CL· cs.LG

A Multiscale Visualization of Attention in the Transformer Model

classification 💻 cs.HC cs.CLcs.LG
keywords modelattentiontransformermechanismtoolaccessibleadvantageapproach
0
0 comments X
read the original abstract

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model by showing how the model assigns weight to different input elements. However, the multi-layer, multi-head attention mechanism in the Transformer model can be difficult to decipher. To make the model more accessible, we introduce an open-source tool that visualizes attention at multiple scales, each of which provides a unique perspective on the attention mechanism. We demonstrate the tool on BERT and OpenAI GPT-2 and present three example use cases: detecting model bias, locating relevant attention heads, and linking neurons to model behavior.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

    cs.IR 2024-09 unverdicted novelty 7.0

    Introduces Trust-RAG Compass framework and TRC Bench benchmark to assess RAG trustworthiness across factuality, robustness, fairness, transparency, accountability, and privacy, with evaluations showing performance gap...

  2. In-context Learning and Induction Heads

    cs.LG 2022-09 unverdicted novelty 7.0

    Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...

  3. Rethinking Attention with Performers

    cs.LG 2020-09 unverdicted novelty 7.0

    Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...

  4. Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

    cs.CL 2026-02 unverdicted novelty 6.0

    CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.

  5. TIDE: Every Layer Knows the Token Beneath the Context

    cs.CL 2026-05 unverdicted novelty 5.0

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

  6. Realistic Channel Models Pre-training

    eess.SP 2019-07 unverdicted novelty 5.0

    A self-supervised pre-trained neural network with multi-domain channel embedding and self-attention is proposed to create realistic wireless channel models combining deterministic accuracy and stochastic uniformity.

  7. DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration

    cs.CL 2023-11 unverdicted novelty 4.0

    DA-Cramming inserts chunk-level dependency agreement embeddings into a dual-stage pretraining pipeline and reports better downstream performance than prior Cramming baselines.