hub

& Wattenberg, M

Smilkov, D · 2017 · cs.LG · arXiv 1706.03825

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

open full Pith review browse 17 citing papers arXiv PDF

abstract

Explaining the output of a deep network remains a challenge. In the case of an image classifier, one type of explanation is to identify pixels that strongly influence the final decision. A starting point for this strategy is the gradient of the class score function with respect to the input image. This gradient can be interpreted as a sensitivity map, and there are several techniques that elaborate on this basic idea. This paper makes two contributions: it introduces SmoothGrad, a simple method that can help visually sharpen gradient-based sensitivity maps, and it discusses lessons in the visualization of these maps. We publish the code for our experiments and a website with our results.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

AGOP as Explanation: From Feature Learning to Per-Sample Attribution in Image Classifiers

cs.LG · 2026-05-12 · conditional · novelty 7.0

AGOP-based attribution methods outperform Integrated Gradients and other baselines on pixel-level ground truth benchmarks for explaining image classifier decisions, with AGOP-Global offering zero inference cost.

Attributions All the Way Down? The Metagame of Interpretability

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoders, and diffusion transformers.

From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models

cs.CV · 2026-05-01 · unverdicted · novelty 7.0

An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.

Low Rank Adaptation for Adversarial Perturbation

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

astro-ph.GA · 2026-04-28 · unverdicted · novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Instructions Shape Production of Language, not Processing

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

cs.CV · 2026-05-11 · accept · novelty 6.0

Scaling vision models by depth and parameter count does not consistently improve localisation-based explanation quality across architectures, datasets, and post-hoc methods; smaller models often perform comparably or better.

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.

When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

A new reliability score computed from the IoU difference between class-specific and class-agnostic heatmaps, boosted by adversarial enhancement, detects false negatives in binary industrial defect detectors with up to 100% recall.

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

cs.AI · 2026-04-20 · conditional · novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

MONAI: An open-source framework for deep learning in healthcare

cs.LG · 2022-11-04 · accept · novelty 6.0

MONAI is a community-supported PyTorch framework that extends deep learning to medical data with domain-specific architectures, transforms, and deployment tools.

CAMAL: Improving Attention Alignment and Faithfulness with Segmentation Masks

eess.IV · 2026-05-08 · unverdicted · novelty 5.0

CAMAL adds an auxiliary regularizer during training that aligns model attention with segmentation masks to improve both spatial accuracy and causal faithfulness of attention in deep learning and deep reinforcement learning vision models.

Path-Sampled Integrated Gradients

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

Path-sampled integrated gradients generalizes integrated gradients by averaging gradients over sampled baselines on the linear path, proving equivalence to a weighted version that improves convergence rate to O(m^{-1}) and reduces variance by a factor of 1/3 under uniform sampling.

Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making

cs.AI · 2026-04-15 · unverdicted · novelty 5.0

This survey synthesizes XAI methods with surrogate modeling workflows for simulations and outlines a research agenda to embed explainability into simulation-driven design and decision-making.

Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

cs.CV · 2026-04-08 · unverdicted · novelty 5.0

A ViT-LSTM spatiotemporal model detects surgical instrument handovers and classifies direction in videos, achieving F1 of 0.84 for detection and 0.72 mean F1 for direction on kidney transplant data.

PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

cs.AI · 2026-04-07 · unverdicted · novelty 5.0

PECKER uses a saliency mask to prioritize parameter updates in distillation-based unlearning, achieving shorter training times for class and concept forgetting on CIFAR-10 and STL-10 while matching prior methods' efficacy.

citing papers explorer

Showing 17 of 17 citing papers.

AGOP as Explanation: From Feature Learning to Per-Sample Attribution in Image Classifiers cs.LG · 2026-05-12 · conditional · none · ref 3 · internal anchor
AGOP-based attribution methods outperform Integrated Gradients and other baselines on pixel-level ground truth benchmarks for explaining image classifier decisions, with AGOP-Global offering zero inference cost.
Attributions All the Way Down? The Metagame of Interpretability cs.LG · 2026-05-07 · unverdicted · none · ref 3
Defines meta-attributions as directional second-order Shapley values on attribution methods, proves hierarchical decomposition of attributions, and demonstrates applications in language models, vision-language encoders, and diffusion transformers.
From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models cs.CV · 2026-05-01 · unverdicted · none · ref 16
An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.
Low Rank Adaptation for Adversarial Perturbation cs.LG · 2026-04-30 · unverdicted · none · ref 55
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning astro-ph.GA · 2026-04-28 · unverdicted · none · ref 67
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 10 · 2 links
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Instructions Shape Production of Language, not Processing cs.CL · 2026-05-11 · unverdicted · none · ref 117 · 2 links · internal anchor
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality cs.CV · 2026-05-11 · accept · none · ref 17
Scaling vision models by depth and parameter count does not consistently improve localisation-based explanation quality across architectures, datasets, and post-hoc methods; smaller models often perform comparably or better.
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers cs.CV · 2026-04-23 · unverdicted · none · ref 40
H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.
When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide cs.CV · 2026-04-21 · unverdicted · none · ref 23
A new reliability score computed from the IoU difference between class-specific and class-agnostic heatmaps, boosted by adversarial enhancement, detects false negatives in binary industrial defect detectors with up to 100% recall.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks cs.AI · 2026-04-20 · conditional · none · ref 57
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
MONAI: An open-source framework for deep learning in healthcare cs.LG · 2022-11-04 · accept · none · ref 35
MONAI is a community-supported PyTorch framework that extends deep learning to medical data with domain-specific architectures, transforms, and deployment tools.
CAMAL: Improving Attention Alignment and Faithfulness with Segmentation Masks eess.IV · 2026-05-08 · unverdicted · none · ref 7
CAMAL adds an auxiliary regularizer during training that aligns model attention with segmentation masks to improve both spatial accuracy and causal faithfulness of attention in deep learning and deep reinforcement learning vision models.
Path-Sampled Integrated Gradients cs.LG · 2026-04-15 · unverdicted · none · ref 12
Path-sampled integrated gradients generalizes integrated gradients by averaging gradients over sampled baselines on the linear path, proving equivalence to a weighted version that improves convergence rate to O(m^{-1}) and reduces variance by a factor of 1/3 under uniform sampling.
Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making cs.AI · 2026-04-15 · unverdicted · none · ref 221
This survey synthesizes XAI methods with surrogate modeling workflows for simulations and outlines a research agenda to embed explainability into simulation-driven design and decision-making.
Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models cs.CV · 2026-04-08 · unverdicted · none · ref 33
A ViT-LSTM spatiotemporal model detects surgical instrument handovers and classifies direction in videos, achieving F1 of 0.84 for detection and 0.72 mean F1 for direction on kidney transplant data.
PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models cs.AI · 2026-04-07 · unverdicted · none · ref 25
PECKER uses a saliency mask to prioritize parameter updates in distillation-based unlearning, achieving shorter training times for class and concept forgetting on CIFAR-10 and STL-10 while matching prior methods' efficacy.

& Wattenberg, M

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer