arxiv: 2604.11467 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.HC· cs.LG

Recognition: unknown

From Attribution to Action: A Human-Centered Application of Activation Steering

Tobias Labarta , Maximilian Dreyer , Katharina Weitz , Wojciech Samek , Sebastian Lapuschkin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG

keywords activation steeringexplainable AImodel debugginghuman-AI interactionCLIPvision modelssparse autoencodersattribution methods

0 comments

The pith

Activation steering paired with attribution lets practitioners test hypotheses about model behavior through direct interventions instead of passive inspection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an interactive workflow that combines sparse autoencoder attribution with activation steering for instance-level analysis in vision models such as CLIP. Expert interviews with eight practitioners performing debugging tasks show that steering shifts users from inspecting explanations to actively intervening and observing resulting model changes. Most participants built trust from those observed responses rather than from the initial attributions alone, while favoring systematic suppression of components and noting risks like unintended ripple effects across predictions.

Core claim

The paper claims that activation steering renders interpretability actionable by enabling intervention-based hypothesis testing, as demonstrated when eight experts used the workflow on CLIP to debug concept usage, grounded their trust primarily in model output changes, adopted suppression-dominated strategies, and identified practical limits including ripple effects and poor generalization of instance-level fixes.

What carries the argument

The interactive web-based workflow that combines SAE-based attribution with activation steering to support instance-level concept analysis and targeted interventions in vision models.

If this is right

Users can verify attributions by steering components and directly observing prediction changes.
Debugging workflows become dominated by targeted suppression of identified components.
Trust in the method rests more on empirical model responses than on the initial attribution quality.
Instance-level steering corrections may not transfer reliably to other inputs or models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The workflow could support iterative model refinement loops where users steer, observe, and repeat until behavior stabilizes.
Risks like ripple effects suggest the need for batch-level steering checks before deploying instance fixes.
Extending the approach beyond vision to language or multimodal models would require new attribution-steering interfaces.

Load-bearing premise

Insights drawn from eight experts performing specific debugging tasks on CLIP reflect how practitioners would generally reason about and apply activation steering in other settings.

What would settle it

A larger study with more diverse practitioners and models where most users continue to rely on explanation plausibility for trust rather than shifting to intervention-based testing.

Figures

Figures reproduced from arXiv: 2604.11467 by Katharina Weitz, Maximilian Dreyer, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek.

**Figure 1.** Figure 1: The four-step workflow from attribution to action: practitioners review component attributions, form causal hypotheses about [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: SemanticLens workflow implementation. Users select inspection samples (1), review components ranked by attribution (2), [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: A process diagram of the interview structure: Pre-questionnaire, two debugging tasks each with attribution-only phase (Phase 1) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Two debugging tasks: Task 1 (typographic attack on CLIP ViT-B-32, where overlaid text causes misclassification) and Task 2 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs SAE attribution with activation steering in a web tool and reports consistent patterns from interviews with eight experts using it for CLIP debugging tasks.

read the letter

The paper takes attribution via sparse autoencoders and adds activation steering so users can intervene on specific components in vision models. They built a web tool for the full loop and ran semi-structured interviews with eight experts on debugging tasks with CLIP. All eight shifted to testing hypotheses by steering rather than just inspecting attributions, six grounded their trust in the actual model outputs after changes, and seven relied mainly on suppressing components. They also noted risks like ripple effects across the model and that instance-level fixes often fail to generalize.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an interactive web-based workflow that integrates Sparse Autoencoder (SAE) attribution with activation steering for instance-level analysis of concept usage in vision models such as CLIP. The workflow is evaluated via semi-structured expert interviews (N=8) on debugging tasks, yielding the claims that steering shifts all participants (8/8) from inspection to intervention-based hypothesis testing, that most (6/8) ground trust in observed model outputs rather than explanation plausibility, that component suppression dominates strategies (7/8), and that users identify risks including ripple effects and limited generalization of instance-level corrections.

Significance. If the qualitative patterns are robust, the work provides timely human-centered evidence that activation steering can convert passive XAI attributions into actionable interventions. The explicit reporting of user strategies and risks (e.g., ripple effects) adds practical value often missing from purely technical steering papers. The implementation of a concrete tool and the focus on real debugging tasks strengthen the bridge from attribution to action, offering design implications for future interpretability systems.

major comments (2)

[Section 4] Section 4 (User Study / Methodology): The description of the semi-structured interview protocol, participant selection criteria, exact debugging tasks, interview guide, qualitative coding process, and any bias-mitigation steps (e.g., inter-rater reliability) is insufficiently detailed. Because the central claims consist of specific fractions (8/8, 6/8, 7/8) derived from these interviews, the absence of this information prevents verification that the reported patterns are not artifacts of task framing or analysis choices.
[Findings and Discussion] Findings and Discussion sections: The extrapolation from N=8 experts performing CLIP-specific debugging to general practitioner reasoning about activation steering is load-bearing for the paper's broader utility claim, yet the manuscript provides no additional validation (e.g., larger sample, different models, or quantitative measures) and only limited caveats regarding selection bias and task specificity. This weakens the link between the observed behaviors and the asserted shift toward actionable interpretability.

minor comments (2)

[Abstract and Section 3] Abstract and Section 3: SAE is introduced without an initial expansion or reference to its standard definition (Sparse Autoencoder), which may hinder readability for readers outside the immediate subfield.
The paper would benefit from a table summarizing participant demographics, task completion times, or strategy frequencies to make the N=8 results more transparent at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The feedback identifies two key areas for improvement: greater methodological transparency in the user study and stronger caveats around generalizability. We agree on both points and will revise the manuscript accordingly. Below we respond point-by-point to the major comments, indicating the changes we will make.

read point-by-point responses

Referee: [Section 4] Section 4 (User Study / Methodology): The description of the semi-structured interview protocol, participant selection criteria, exact debugging tasks, interview guide, qualitative coding process, and any bias-mitigation steps (e.g., inter-rater reliability) is insufficiently detailed. Because the central claims consist of specific fractions (8/8, 6/8, 7/8) derived from these interviews, the absence of this information prevents verification that the reported patterns are not artifacts of task framing or analysis choices.

Authors: We agree that the current description of the methodology is insufficient for independent verification. In the revised manuscript we will substantially expand Section 4 to include: (1) the complete semi-structured interview protocol and the full interview guide with example questions; (2) explicit participant selection criteria, including recruitment channels, years of experience in ML interpretability, and prior exposure to vision-language models; (3) precise descriptions of the three debugging tasks, including the images, target concepts, and success criteria given to participants; (4) the qualitative analysis procedure, specifying the thematic coding approach, how the 8/8, 6/8, and 7/8 counts were derived, and any inter-rater reliability assessment (we will report Cohen’s kappa or describe the consensus process used); and (5) bias-mitigation steps such as pilot interviews, question ordering to avoid leading participants, and steps taken to reduce confirmation bias during coding. These additions will allow readers to evaluate whether the observed patterns could be artifacts of task framing or analysis choices. revision: yes
Referee: [Findings and Discussion] Findings and Discussion sections: The extrapolation from N=8 experts performing CLIP-specific debugging to general practitioner reasoning about activation steering is load-bearing for the paper's broader utility claim, yet the manuscript provides no additional validation (e.g., larger sample, different models, or quantitative measures) and only limited caveats regarding selection bias and task specificity. This weakens the link between the observed behaviors and the asserted shift toward actionable interpretability.

Authors: We acknowledge that the manuscript’s language occasionally implies broader applicability than the N=8, CLIP-specific evidence strictly supports. While qualitative studies of this size are common in human-centered XAI for generating initial insights, we agree that stronger caveats and clearer scoping are needed. In the revision we will: (1) expand the Limitations subsection to explicitly address selection bias (participants were drawn from a convenience sample of interpretability researchers), task specificity (debugging tasks were limited to CLIP on natural-image classification), and the absence of quantitative or cross-model validation; (2) revise the Findings and Discussion sections to use more qualified language (e.g., “in this study, all participants…” rather than generalizing to “practitioners”); and (3) add a forward-looking paragraph outlining concrete next steps for larger-scale or quantitative follow-up studies. We maintain that the observed shift from inspection to intervention-based reasoning is a substantive finding for the studied setting and provides actionable design implications, but we will no longer present it as a general claim without further evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical interview study with no derivations or self-referential predictions

full rationale

The paper reports qualitative findings from semi-structured interviews (N=8) on a specific SAE+steering workflow for CLIP debugging tasks. It contains no mathematical equations, fitted parameters, predictions, or derivation chains. All central claims (e.g., 8/8 participants shifting to intervention-based testing, 7/8 using component suppression) are directly grounded in new interview data rather than reducing to prior author work, self-definitions, or fitted inputs. Self-citations, if present, are not load-bearing for the reported observations. The study is self-contained against external benchmarks as a human-centered empirical investigation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical human-computer interaction study with no mathematical derivations. No free parameters, axioms, or invented entities are introduced; the workflow builds on existing SAE and activation steering techniques evaluated through new qualitative data.

pith-pipeline@v0.9.0 · 5491 in / 1288 out tokens · 77246 ms · 2026-05-10T14:55:24.712764+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 15 canonical work pages · 5 internal anchors

[1]

From attribution maps to human-understandable explanations through concept rele- vance propagation.Nature Machine Intelligence, 5(9):1006– 1019, 2023

Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept rele- vance propagation.Nature Machine Intelligence, 5(9):1006– 1019, 2023. 1

2023
[2]

Backpropagation and stochastic gradient descent method.Neurocomputing, 5(4):185–196, 1993

Shun-ichi Amari. Backpropagation and stochastic gradient descent method.Neurocomputing, 5(4):185–196, 1993. 1

1993
[3]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015

Sebastian Bach, Alexander Binder, Gr ´egoire Montavon, Frederick Klauschen, Klaus-Robert M ¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015. 1

2015
[4]

Does the whole exceed its parts? the effect of ai explanations on complementary team performance

Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InPro- ceedings of the 2021 CHI conference on human factors in computing systems, pages 1–16, 2021. 3

2021
[5]

Network dissection: Quantifying inter- pretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying inter- pretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6541–6549, 2017. 1

2017
[6]

Principles and prac- tice of explainable machine learning.Frontiers in big Data, 4:688969, 2021

Vaishak Belle and Ioannis Papantonis. Principles and prac- tice of explainable machine learning.Frontiers in big Data, 4:688969, 2021. 1

2021
[7]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Ja- cob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. 2

work page internal anchor Pith review arXiv 2023
[8]

Towards unifying interpretability and control: Evaluation via intervention.arXiv preprint arXiv:2411.04430, 2024

Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, and Himabindu Lakkaraju. Towards unifying interpretability and control: Evaluation via intervention.arXiv preprint arXiv:2411.04430, 2024. 2

work page arXiv 2024
[9]

Human-centered tools for coping with imperfect algorithms during medical decision-making

Carrie J Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S Corrado, Martin C Stumpe, et al. Human-centered tools for coping with imperfect algorithms during medical decision-making. InProceedings of the 2019 CHI Confer- ence on Human Factors in Computing Systems, pages 1–14,

2019
[10]

Causal scrubbing: A method for rigorously testing inter- pretability hypotheses

Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: A method for rigorously testing inter- pretability hypotheses. InAI Alignment Forum, page 19,
[11]

Thematic analysis.The journal of positive psychology, 12(3):297–298, 2017

Victoria Clarke and Virginia Braun. Thematic analysis.The journal of positive psychology, 12(3):297–298, 2017. 6

2017
[12]

(2020).Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey

Arun Das and Paul Rad. Opportunities and challenges in explainable artificial intelligence (xai): A survey.arXiv preprint arXiv:2006.11371, 2020. 1

work page arXiv 2006
[13]

The effectiveness of style vectors for steering large language models: A human evaluation.IEEE Access, 13:191443–191457, 2025

Diaoul ´e Diallo, Katharina Dworatzyk, Sophie Jentzsch, Peer Sch¨utt, Sabine Theis, and Tobias Hecking. The effectiveness of style vectors for steering large language models: A human evaluation.IEEE Access, 13:191443–191457, 2025. 2

2025
[14]

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017. 3

work page internal anchor Pith review arXiv 2017
[15]

Mechanistic understanding and validation of large ai models with semanticlens.Nature Machine Intel- ligence, 7(9):1572–1585, 2025

Maximilian Dreyer, Jim Berend, Tobias Labarta, Johanna Vielhaben, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. Mechanistic understanding and validation of large ai models with semanticlens.Nature Machine Intel- ligence, 7(9):1572–1585, 2025. 2

2025
[16]

From what to how: Attributing clip’s latent components reveals unexpected semantic reliance.arXiv preprint arXiv:2505.20229, 2025

Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. From what to how: Attributing clip’s latent components reveals unexpected semantic reliance.arXiv preprint arXiv:2505.20229, 2025. 2, 3

work page arXiv 2025
[17]

The think aloud method: what is it and how do i use it?Qualitative Research in Sport, Exercise and Health, 9(4):514–531, 2017

David W Eccles and G ¨uler Arsal. The think aloud method: what is it and how do i use it?Qualitative Research in Sport, Exercise and Health, 9(4):514–531, 2017. 4

2017
[18]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

work page internal anchor Pith review arXiv
[19]

Beatrice Fazi

M. Beatrice Fazi. Beyond human: Deep learning, explain- ability and representation.Theory, Culture & Society, 38(7): 55–77, 2021-12. 1

2021
[20]

Causal abstractions of neural networks.Advances 9 in Neural Information Processing Systems, 34:9574–9586,

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances 9 in Neural Information Processing Systems, 34:9574–9586,
[21]

Causal abstraction: A theoretical foundation for mechanis- tic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaud- hary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, et al. Causal abstraction: A theoretical foundation for mechanis- tic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025. 2

2025
[22]

Who’s asking? user personas and the mechanics of latent misalignment

Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael Lepori, and Lucas Dixon. Who’s asking? user personas and the mechanics of latent misalignment. Advances in Neural Information Processing Systems, 37: 125967–126003, 2024. 2

2024
[23]

Qualitatively characterizing neural network optimization problems

Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qual- itatively characterizing neural network optimization prob- lems.arXiv preprint arXiv:1412.6544, 2014. 1

work page Pith review arXiv 2014
[24]

Counterfactual explanations and how to find them: literature review and benchmarking.Data Mining and Knowledge Discovery, 38(5):2770–2824, 2024

Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking.Data Mining and Knowledge Discovery, 38(5):2770–2824, 2024. 1

2024
[25]

A Sur- vey of Methods for Explaining Black Box Models.ACM Comput

Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A Sur- vey of Methods for Explaining Black Box Models.ACM Comput. Surv., 51(5):93:1–93:42, 2018. 1

2018
[26]

Causability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,

Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zat- loukal, and Heimo M ¨uller. Causability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,
[27]

arXiv preprint arXiv:2508.21258 (2025) 3

Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, and Neel Nanda. Relp: Faithful and efficient circuit discovery in language models via relevance patching.arXiv preprint arXiv:2508.21258, 2025. 2

work page arXiv 2025
[28]

Steering clip’s vi- sion transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025

Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Woj- ciech Samek, and Blake Aaron Richards. Steering clip’s vi- sion transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025. 2, 3

work page arXiv 2025
[29]

Mind the gap! bridging explainable artificial intelligence and human understanding with luhmann’s functional theory of communication.arXiv preprint arXiv:2302.03460, 2023

Bernard Keenan and Kacper Sokol. Mind the gap! bridging explainable artificial intelligence and human understanding with luhmann’s functional theory of communication.arXiv preprint arXiv:2302.03460, 2023. 1, 2

work page arXiv 2023
[30]

Interpretability be- yond feature attribution: Quantitative testing with concept activation vectors (tcav)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability be- yond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on ma- chine learning, pages 2668–2677. PMLR, 2018. 1

2018
[31]

Human- centered evaluation of explainable ai applications: a system- atic review.Frontiers in Artificial Intelligence, 7:1456486,

Jenia Kim, Henry Maathuis, and Danielle Sent. Human- centered evaluation of explainable ai applications: a system- atic review.Frontiers in Artificial Intelligence, 7:1456486,
[32]

Hu- man evaluation of models built for interpretability

Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. Hu- man evaluation of models built for interpretability. InPro- ceedings of the AAAI conference on human computation and crowdsourcing, pages 59–67, 2019. 3

2019
[33]

Conceptviz: A visual analytics approach for ex- ploring concepts in large language models.IEEE Transac- tions on Visualization and Computer Graphics, 2025

Haoxuan Li, Zhen Wen, Qiqi Jiang, Chenxiao Li, Yuwei Wu, Yuchen Yang, Yiyao Wang, Xiuqi Huang, Minfeng Zhu, and Wei Chen. Conceptviz: A visual analytics approach for ex- ploring concepts in large language models.IEEE Transac- tions on Visualization and Computer Graphics, 2025. 2

2025
[34]

A unified approach to interpreting model predictions.Advances in Neural Infor- mation Processing Systems, 30, 2017

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.Advances in Neural Infor- mation Processing Systems, 30, 2017. 1

2017
[35]

Evalu- ating actionability in explainable ai.arXiv preprint arXiv:2601.20086, 2026

Gennie Mansi, Julia Kim, and Mark Riedl. Evalu- ating actionability in explainable ai.arXiv preprint arXiv:2601.20086, 2026. 1, 2

work page arXiv 2026
[36]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35: 17359–17372, 2022. 2

2022
[37]

arXiv preprint arXiv:1901.08644 , year=

Richard Meyes, Melanie Lu, Constantin Waubert De Puiseau, and Tobias Meisen. Ablation studies in artificial neural networks.arXiv preprint arXiv:1901.08644,

work page arXiv 1901
[38]

Interventionist methods for interpreting deep neural networks

Rapha ¨el Milli `ere and Cameron Buckner. Interventionist methods for interpreting deep neural networks. InNeurocog- nitive Foundations of Mind. Routledge, 2025. 2, 3

2025
[39]

The quest for the right mediator: A history, survey, and the- oretical grounding of causal interpretability.arXiv e-prints, pages arXiv–2408, 2024

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, et al. The quest for the right mediator: A history, survey, and the- oretical grounding of causal interpretability.arXiv e-prints, pages arXiv–2408, 2024. 2

2024
[40]

An overview of the empirical evaluation of explainable ai (xai): A comprehensive guideline for user-centered evaluation in xai.Applied Sciences, 14(23):11288, 2024

Sidra Naveed, Gunnar Stevens, and Dean Robin-Kern. An overview of the empirical evaluation of explainable ai (xai): A comprehensive guideline for user-centered evaluation in xai.Applied Sciences, 14(23):11288, 2024. 3

2024
[41]

Konstantinos Nikiforidis, Alkiviadis Kyrtsoglou, Thanasis Vafeiadis, Thanasis Kotsiopoulos, Alexandros Nizamis, Di- mosthenis Ioannidis, Konstantinos V otis, Dimitrios Tzo- varas, and Panagiotis Sarigiannidis. Enhancing transparency and trust in AI-powered manufacturing: A survey of explain- able AI (XAI) applications in smart manufacturing in the era of ...

2025
[42]

Zoom in: An in- troduction to circuits.Distill, 5(3):e00024–001, 2020

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An in- troduction to circuits.Distill, 5(3):e00024–001, 2020. 2

2020
[43]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023. 2

work page internal anchor Pith review arXiv 2023
[44]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

2021
[45]

” why should i trust you?” explaining the predictions of any classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust you?” explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. 1

2016
[46]

Steering llama 2 via con- 10 trastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via con- 10 trastive activation addition. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 15504–15522, 2024. 2

2024
[47]

Ripplebench: Capturing ripple effects by leveraging exist- ing knowledge repositories

Roy Rinberg, Usha Bhalla, Igor Shilov, and Rohit Gandikota. Ripplebench: Capturing ripple effects by leveraging exist- ing knowledge repositories. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. 8

2025
[48]

Parallel net- works that learn to pronounce english text.Complex Systems, 1(1):145–168, 1987

Terrence J Sejnowski and Charles R Rosenberg. Parallel net- works that learn to pronounce english text.Complex Systems, 1(1):145–168, 1987. 2

1987
[49]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InInternational Conference on Machine Learn- ing, pages 3145–3153. PMlR, 2017. 3

2017
[50]

Neural and conceptual interpretation of pdp models.Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 2:390–431, 1986

Paul Smolensky. Neural and conceptual interpretation of pdp models.Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 2:390–431, 1986. 2

1986
[51]

Suny: A visual interpretation framework for convolutional neural networks from a necessary and suf- ficient perspective

Xiwei Xuan, Ziquan Deng, Hsuan-Tien Lin, Zhaodan Kong, and Kwan-Liu Ma. Suny: A visual interpretation framework for convolutional neural networks from a necessary and suf- ficient perspective. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8371–8376, 2024. 2

2024
[52]

Slim: Spuriousness mitigation with minimal human annotations

Xiwei Xuan, Ziquan Deng, Hsuan-Tien Lin, and Kwan-Liu Ma. Slim: Spuriousness mitigation with minimal human annotations. InEuropean Conference on Computer Vision, pages 215–231. Springer, 2024. 2

2024
[53]

Attributionscanner: A visual analytics system for model validation with metadata-free slice finding

Xiwei Xuan, Jorge Piazentin Ono, Liang Gou, Kwan-Liu Ma, and Liu Ren. Attributionscanner: A visual analytics system for model validation with metadata-free slice finding. IEEE transactions on visualization and computer graphics,
[54]

Vislix: An xai framework for val- idating vision models with slice discovery and analysis

Xinyuan Yan, Xiwei Xuan, Jorge Piazentin Ono, Jiajing Guo, Vikram Mohanty, Shekar Arvind Kumar, Liang Gou, Bei Wang, and Liu Ren. Vislix: An xai framework for val- idating vision models with slice discovery and analysis. In Computer Graphics Forum, page e70125. Wiley Online Li- brary, 2025. 2

2025
[55]

A textbook remedy for domain shifts: Knowledge priors for medical image analysis

Yue Yang, Mona Gandhi, Yufei Wang, Yifan Wu, Michael S Yao, Chris Callison-Burch, James Gee, and Mark Yatskar. A textbook remedy for domain shifts: Knowledge priors for medical image analysis. InAdvancements In Medical Foun- dation Models: Explainability, Robustness, Security, and Be- yond, 2024. 4

2024
[56]

2023 , archivePrefix=

Fred Zhang and Neel Nanda. Towards best practices of ac- tivation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023. 2

work page arXiv 2023
[57]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, et al. Locate, steer, and improve: A practical survey of actionable mechanistic interpretability in large language models.arXiv preprint arXiv:2601.14004,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Ef- fect of confidence and explanation on accuracy and trust cal- ibration in ai-assisted decision making

Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. Ef- fect of confidence and explanation on accuracy and trust cal- ibration in ai-assisted decision making. InProceedings of the 2020 Conference on Fairness, Accountability, and Trans- parency, pages 295–305, 2020. 3

2020
[59]

Explainability for large language models: A survey

Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Meng- nan Du. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2):1–38, 2024. 2 11

2024