Language-model-guided program synthesis can approximate transformer attention heads with over 75% IoU fidelity on held-out data and allow replacing 25% of heads with only 16% average perplexity increase.
hub Mixed citations
Attention is not Explanation
Mixed citation behavior. Most common role is background (67%).
abstract
Attention mechanisms have seen wide adoption in neural NLP models. In addition to improving predictive performance, these are often touted as affording transparency: models equipped with attention provide a distribution over attended-to input units, and this is often presented (at least implicitly) as communicating the relative importance of inputs. However, it is unclear what relationship exists between attention weights and model outputs. In this work, we perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful `explanations' for predictions. We find that they largely do not. For example, learned attention weights are frequently uncorrelated with gradient-based measures of feature importance, and one can identify very different attention distributions that nonetheless yield equivalent predictions. Our findings show that standard attention modules do not provide meaningful explanations and should not be treated as though they do. Code for all experiments is available at https://github.com/successar/AttentionExplanation.
hub tools
citation-role summary
citation-polarity summary
roles
background 8representative citing papers
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
CLIF applies influence functions to pinpoint influential samples and concepts in CBMs on CEBaB and Yelp datasets, enabling performance restoration via adjustments without retraining.
The paper proposes an architecture-aware explanation audit protocol demonstrating that perturbation-based faithfulness is bounded by structural compatibility between explainer and model readout rather than architecture family.
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
EviSnap creates cross-domain recommendations whose scores decompose exactly into evidence-cited concept contributions via offline LLM facet extraction, clustering, and linear transfer.
CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
Stylistic rewrites of harmful prompts raise attack success rates from 3.84% to 36.8-65% across 31 frontier models, indicating weak generalization in safety refusals.
Surrounding agents frequently degrade trajectory prediction accuracy in interactive driving scenes, and integrating a Conditional Information Bottleneck improves results by ignoring non-beneficial contextual signals.
PID applied to MLLMs identifies task-specific modality interaction profiles that generalize across models, extend to tri-modal cases, and yield initial performance gains via reweighting.
INSIGHTS creates manageable global summaries of time series model behavior by balancing sample importance and diversity with domain-specific utility functions, validated via experiments and user studies.
Analysis of transformer attention heads in abstractive summarization shows specialization in some heads and proposes a method to measure model reliance on learned attention distributions.
Compares LIME, input perturbation and attention for explaining QA on KB+text; proposes automatic evaluation paradigm and finds input perturbation superior in both automatic and human studies.
SAIL integrates anatomical priors at the representation level with semantic features via fusion to produce more anatomically aligned attribution maps in OCT without altering existing explainability techniques.
HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions than prior methods.
CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.
citing papers explorer
-
EviSnap: Faithful Evidence-Cited Explanations for Cold-Start Cross-Domain Recommendation
EviSnap creates cross-domain recommendations whose scores decompose exactly into evidence-cited concept contributions via offline LLM facet extraction, clustering, and linear transfer.