hub Mixed citations

Attention is not Explanation

· 2019 · cs.CL · arXiv 1902.10186

Mixed citation behavior. Most common role is background (67%).

23 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Attention mechanisms have seen wide adoption in neural NLP models. In addition to improving predictive performance, these are often touted as affording transparency: models equipped with attention provide a distribution over attended-to input units, and this is often presented (at least implicitly) as communicating the relative importance of inputs. However, it is unclear what relationship exists between attention weights and model outputs. In this work, we perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful `explanations' for predictions. We find that they largely do not. For example, learned attention weights are frequently uncorrelated with gradient-based measures of feature importance, and one can identify very different attention distributions that nonetheless yield equivalent predictions. Our findings show that standard attention modules do not provide meaningful explanations and should not be treated as though they do. Code for all experiments is available at https://github.com/successar/AttentionExplanation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 6 support 3

representative citing papers

Explaining Attention with Program Synthesis

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.

Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

cs.LG · 2026-05-07 · accept · novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CLIF applies influence functions to pinpoint influential samples and concepts in CBMs on CEBaB and Yelp datasets, enabling performance restoration via adjustments without retraining.

Architecture-Aware Explanation Auditing for Industrial Visual Inspection

cs.LG · 2026-05-14 · unverdicted · novelty 6.0 · 3 refs

The paper proposes an architecture-aware explanation audit protocol demonstrating that perturbation-based faithfulness is bounded by structural compatibility between explainer and model readout rather than architecture family.

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Shapley value analysis identifies powerful adjectives that steer MMLU performance in model-family-specific patterns, with non-additive interactions emerging in larger models.

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

cs.CL · 2026-02-18 · unverdicted · novelty 6.0

CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.

EviSnap: Faithful Evidence-Cited Explanations for Cold-Start Cross-Domain Recommendation

cs.IR · 2026-01-09 · unverdicted · novelty 6.0

EviSnap creates cross-domain recommendations whose scores decompose exactly into evidence-cited concept contributions via offline LLM facet extraction, clustering, and linear transfer.

Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation

cs.SE · 2025-03-21 · unverdicted · novelty 6.0

CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.

Super Agents and Confounders: Influence of surrounding agents on vehicle trajectory prediction

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Surrounding agents frequently degrade trajectory prediction accuracy in interactive driving scenes, and integrating a Conditional Information Bottleneck improves results by ignoring non-beneficial contextual signals.

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

cs.AI · 2026-05-31 · unverdicted · novelty 5.0

PID applied to MLLMs identifies task-specific modality interaction profiles that generalize across models, extend to tri-modal cases, and yield initial performance gains via reweighting.

INSIGHTS: Demonstration-Based Summaries of Time Series Predictors

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

INSIGHTS creates manageable global summaries of time series model behavior by balancing sample importance and diversity with domain-specific utility functions, validated via experiments and user studies.

Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

cs.CL · 2019-07-01 · unverdicted · novelty 5.0

Analysis of transformer attention heads in abstractive summarization shows specialization in some heads and proposes a method to measure model reliance on learned attention distributions.

Interpretable Question Answering on Knowledge Bases and Text

cs.CL · 2019-06-26 · unverdicted · novelty 5.0

Compares LIME, input perturbation and attention for explaining QA on KB+text; proposes automatic evaluation paradigm and finds input perturbation superior in both automatic and human studies.

SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

SAIL integrates anatomical priors at the representation level with semantic features via fusion to produce more anatomically aligned attribution maps in OCT without altering existing explainability techniques.

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

cs.CL · 2026-04-14 · unverdicted · novelty 5.0

HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions than prior methods.

Uncertainty-Aware Transformers: Conformal Prediction for Language Models

cs.LG · 2026-04-10 · unverdicted · novelty 5.0

CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.

Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions

cs.CY · 2026-02-27 · 2 refs

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

cs.LG · 2025-08-06

citing papers explorer

Showing 1 of 1 citing paper after filters.

Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation cs.SE · 2025-03-21 · unverdicted · none · ref 26 · internal anchor
CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.

Attention is not Explanation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer