pith. machine review for the scientific record. sign in

arxiv: 2605.12809 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: unknown

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords influence functionstoken-level attributionsparse autoencoderslatent mediationlarge language modelsmodel auditingJacobian-vector products
0
0 comments X

The pith

A latent mediation method using sparse autoencoders delivers reliable token-level influence attribution for LLM predictions on any task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a way to trace which specific tokens in training data drive an LLM's output by first mapping the model into a space of independent latent features. Influence is computed there because it does not break down additively across tokens, then mapped back to the original input tokens through activation patterns and Jacobian-vector products. This removes the token-independence assumption that limited earlier influence-function work to autoregressive models only. Experiments on medical benchmarks show the method returns sparse, human-readable sets of tokens that jointly shape a prediction. The result supports auditing and accountability when LLMs are used in high-stakes settings.

Core claim

Attaching sparse autoencoders to any layer of a pretrained LLM produces a basis of approximately independent latent features; influence is then calculated over these features and propagated back to the input space via token activation patterns and Jacobian-vector products, yielding token-level attributions that work for general prediction tasks rather than being restricted to autoregressive settings.

What carries the argument

Sparse autoencoders attached to LLM layers that learn an approximately independent basis of latent features, combined with Jacobian-vector products to propagate non-decomposable latent attributions back through token activation patterns.

Load-bearing premise

Sparse autoencoders learn a basis of approximately independent latent features whose influence can be propagated back to tokens via Jacobian-vector products without introducing new biases.

What would settle it

Remove the tokens the method flags as influential from the training set, retrain the model, and verify whether the original prediction changes more than when the same number of randomly chosen tokens are removed.

Figures

Figures reproduced from arXiv: 2605.12809 by Kyra Gan, Promit Ghosal, Shixing Yu.

Figure 1
Figure 1. Figure 1: Pipeline overview. Overview of RepInfLLM. A domain-specific LLM is first finetuned, then SAEs are swept over intermediate layers (25%–75%) to select a representative latent space. During inference, the selected SAE is inserted inline to map both training and test instances into shared sparse latents, enabling influence attribution directly in representation space. The prediction follows the standard forwar… view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. Traditional influence functions operate in the input space, assuming token independence and decomposable losses. Our method introduces a sparse autoencoder at an intermediate layer, splitting the model into upstream and downstream parts. Influence is then computed at the representation level using JVPs, enabling stable per-feature attributions and linking test predictions to interpretab… view at source ↗
Figure 3
Figure 3. Figure 3: Necessity and sufficiency tests on OpenbookQA. For necessity, we rank and remove top- [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CommonsenseQA necessity (remove top-k features) and sufficiency (keep top-k features) tests comparing influence-selected features to baselines. entangled with stable rank 1.17, 2.0% near-orthogonal pairs, which says that most of the features are highly entangled with each other. Whereas SAE latents with high latent dimensions are substantially more disentangled with stable rank 25.02 and 98.67% near orthog… view at source ↗
Figure 5
Figure 5. Figure 5: Token-level influence visualizations for a representative OpenBookQA test question (top), [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

A critical step for reliable large language models (LLMs) use in healthcare is to attribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a decision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that infers token-level influence through a latent mediation approach for general prediction tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence computed over latent features is inherently non-decomposable. To address this, we introduce a novel method using Jacobian-vector products. Token-level influence is obtained by propagating latent attributions back to the input space via token activation patterns. We scale our approach using efficient inverse-Hessian approximations. Experiments on medical benchmarks show our approach identifies sparse, interpretable sets of tokens that jointly influence predictions. Our framework enhances trust and enables model auditing, generalizing to high-stakes domain requiring transparent and accountable decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a framework for token-level influence attribution in LLMs for general (non-autoregressive) prediction tasks. It attaches sparse autoencoders to arbitrary layers to extract approximately independent latent features, computes influence over these latents, and propagates the attributions back to input tokens via Jacobian-vector products applied to token activation patterns. This is claimed to correct the token-independence assumption of prior influence-function work. The method is scaled with inverse-Hessian approximations and is evaluated on medical benchmarks, where it reportedly identifies sparse, interpretable token sets that jointly influence predictions.

Significance. If the Jacobian propagation step can be shown to recover faithful token-level attributions without re-introducing bias from non-linearities or SAE reconstruction error, the approach would provide a practical way to audit LLM decisions at token granularity in high-stakes domains. The use of SAEs to mediate non-decomposable influence is a conceptually clean idea that could generalize beyond the medical setting.

major comments (3)
  1. [Method (Jacobian propagation paragraph)] The central technical step—propagating latent attributions to tokens via Jacobian-vector products—rests on the assumption that first-order derivatives suffice to capture the mapping from SAE latents through non-linear LLM layers. The skeptic note correctly flags that partial polysemanticity or strong non-linearities at the chosen layer could distort attributions; the manuscript must supply either a theoretical bound on the approximation error or an empirical check (e.g., comparison against exact influence on a small model or synthetic data with known ground-truth tokens).
  2. [Experiments section] The abstract states that experiments on medical benchmarks demonstrate identification of sparse, interpretable token sets, yet supplies no quantitative metrics, ablation results, or error analysis. Without these, it is impossible to judge whether the method actually improves upon prior influence functions or merely reproduces their limitations under a different parameterization.
  3. [Scaling and implementation details] The inverse-Hessian approximation is listed among the free parameters; its concrete implementation (e.g., LiSSA, conjugate-gradient, or damping schedule) and sensitivity analysis must be reported, because any instability in the Hessian inverse directly affects the reliability of the latent-level influence scores before the JVP step.
minor comments (2)
  1. [Method] Notation for the Jacobian-vector product and the precise definition of “token activation patterns” should be introduced with an equation rather than prose only.
  2. [SAE attachment paragraph] The abstract claims the SAE latents are “approximately independent”; a quantitative measure of residual correlation (e.g., average pairwise cosine similarity of decoder weights) would strengthen this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight key areas where additional theoretical and empirical support will strengthen the presentation. We address each major comment below and will incorporate the requested clarifications and analyses in the revised version.

read point-by-point responses
  1. Referee: [Method (Jacobian propagation paragraph)] The central technical step—propagating latent attributions to tokens via Jacobian-vector products—rests on the assumption that first-order derivatives suffice to capture the mapping from SAE latents through non-linear LLM layers. The skeptic note correctly flags that partial polysemanticity or strong non-linearities at the chosen layer could distort attributions; the manuscript must supply either a theoretical bound on the approximation error or an empirical check (e.g., comparison against exact influence on a small model or synthetic data with known ground-truth tokens).

    Authors: We agree that the first-order JVP approximation requires explicit justification. In the revised manuscript we will add a dedicated subsection deriving a first-order error bound under the assumption that SAE latents are sufficiently sparse and that activation perturbations remain small. We will also include an empirical validation: on a 125M-parameter model with synthetic data containing known ground-truth token influences, we compare the JVP-based attributions against exact (Hessian-free) influence values and report the resulting correlation and top-k recovery rates. revision: yes

  2. Referee: [Experiments section] The abstract states that experiments on medical benchmarks demonstrate identification of sparse, interpretable token sets, yet supplies no quantitative metrics, ablation results, or error analysis. Without these, it is impossible to judge whether the method actually improves upon prior influence functions or merely reproduces their limitations under a different parameterization.

    Authors: The original submission emphasized qualitative case studies on medical benchmarks to illustrate interpretability. We acknowledge that quantitative support is necessary. The revised version will add: (i) precision@K and recall@K against available ground-truth token sets, (ii) ablation tables varying SAE sparsity and layer choice, and (iii) direct comparison against standard influence-function baselines on the same benchmarks, including error bars over multiple random seeds. revision: yes

  3. Referee: [Scaling and implementation details] The inverse-Hessian approximation is listed among the free parameters; its concrete implementation (e.g., LiSSA, conjugate-gradient, or damping schedule) and sensitivity analysis must be reported, because any instability in the Hessian inverse directly affects the reliability of the latent-level influence scores before the JVP step.

    Authors: We will expand the implementation appendix to specify that we employ the LiSSA estimator with 10 iterations, damping factor 0.01, and a fixed random seed for reproducibility. A new sensitivity plot will show how latent influence scores vary across damping values in {0.001, 0.01, 0.1} and iteration counts in {5, 10, 20}, confirming that the reported token attributions remain stable within the chosen operating range. revision: yes

Circularity Check

0 steps flagged

Extends influence functions with SAE latents and JVP; no derivation reduces to fitted input by construction

full rationale

The framework attaches SAEs to learn approximately independent latents then propagates attributions via Jacobian-vector products. This builds directly on established influence-function machinery without any equation or step equating the final token-level influence to a fitted parameter or self-cited uniqueness result. The abstract and description contain no self-definitional, fitted-input-called-prediction, or load-bearing self-citation patterns that collapse the claimed output to the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; full derivation and validation details unavailable. The central claim rests on the domain assumption that SAEs produce sufficiently independent features.

free parameters (1)
  • inverse-Hessian approximation
    Used to scale computation; specific form and accuracy not detailed in abstract.
axioms (1)
  • domain assumption Sparse autoencoders learn a basis of approximately independent latent features from any LLM layer
    Invoked to justify non-decomposable influence computation over latents rather than tokens.

pith-pipeline@v0.9.0 · 5521 in / 1215 out tokens · 38563 ms · 2026-05-14T20:14:06.178610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 43 internal anchors

  1. [1]

    International Conference on Learning Representations , year=

    Auto-encoding variational bayes , author=. International Conference on Learning Representations , year=

  2. [2]

    Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=

    Independent component analysis: recent advances , author=. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=. 2013 , publisher=

  3. [3]

    Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences , volume=

    Principal component analysis: a review and recent developments , author=. Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences , volume=. 2016 , publisher=

  4. [4]

    arXiv preprint arXiv:2411.07618 , year=

    Direct preference optimization using sparse feature-level constraints , author=. arXiv preprint arXiv:2411.07618 , year=

  5. [5]

    Transformer Circuits Thread , volume=

    Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , volume=

  6. [6]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  7. [7]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  8. [8]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  9. [9]

    International conference on machine learning , pages=

    Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=

  10. [10]

    Advances in Neural Information Processing Systems , year=

    Evaluating large language models trained on code , author=. Advances in Neural Information Processing Systems , year=

  11. [11]

    Proceedings of EMNLP , year=

    PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of EMNLP , year=

  12. [12]

    Proceedings of the 18th International Conference on Artificial Intelligence and Law , year=

    When does pretraining help? Assessing self-supervised learning for law and the casehold dataset , author=. Proceedings of the 18th International Conference on Artificial Intelligence and Law , year=

  13. [13]

    Proceedings of NeurIPS , year=

    Program synthesis with large language models , author=. Proceedings of NeurIPS , year=

  14. [14]
  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Leandojo: Theorem proving with retrieval-augmented language models , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Journal of the american statistical association , volume=

    The influence curve and its role in robust estimation , author=. Journal of the american statistical association , volume=. 1974 , publisher=

  17. [17]

    arXiv preprint arXiv:2202.01344 , year=

    Formal mathematics statement curriculum learning , author=. arXiv preprint arXiv:2202.01344 , year=

  18. [18]

    Nature , volume=

    Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    Journal of Machine Learning Research , volume=

    Second-order stochastic optimization for machine learning in linear time , author=. Journal of Machine Learning Research , volume=

  21. [21]

    International conference on machine learning , pages=

    Optimizing neural networks with kronecker-factored approximate curvature , author=. International conference on machine learning , pages=. 2015 , organization=

  22. [22]

    International Conference on Machine Learning , pages=

    A kronecker-factored approximate fisher matrix for convolution layers , author=. International Conference on Machine Learning , pages=. 2016 , organization=

  23. [23]

    Fast approximate natural gradient descent in a kronecker factored eigenbasis , url =

    George, Thomas and Laurent, C. Fast approximate natural gradient descent in a kronecker factored eigenbasis , url =. Advances in Neural Information Processing Systems , month =

  24. [24]

    Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

    Studying large language model generalization with influence functions , author=. arXiv preprint arXiv:2308.03296 , year=

  25. [25]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Sparse autoencoders find highly interpretable features in language models , author=. arXiv preprint arXiv:2309.08600 , year=

  26. [26]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Sparse feature circuits: Discovering and editing interpretable causal graphs in language models , author=. arXiv preprint arXiv:2403.19647 , year=

  27. [27]

    Nature Medicine , volume=

    Toward expert-level medical question answering with large language models , author=. Nature Medicine , volume=. 2025 , publisher=

  28. [28]

    Chain-of-Thought Is Not Explainability , publisher =

    Barez, Fazl and Wu, Tung-Yu and Arcuschin, Iván and Lan, Michael and Wang, Vincent and Siegel, Noah and Collignon, Nicolas and Neo, Clement and Lee, Isabelle and Paren, Alasdair and Bibi, Adel and Trager, Robert and Fornasiere, Damiano and Yan, John and Elazar, Yanai and Bengio, Yoshua , keywords =. Chain-of-Thought Is Not Explainability , publisher =

  29. [29]

    2002 , publisher=

    The implicit function theorem: history, theory, and applications , author=. 2002 , publisher=

  30. [30]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  31. [31]

    Nature medicine , volume=

    High-performance medicine: the convergence of human and artificial intelligence , author=. Nature medicine , volume=. 2019 , publisher=

  32. [32]

    ACM computing surveys , volume=

    Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=. 2023 , publisher=

  33. [33]

    International Conference on Machine Learning , pages=

    Counterfactual off-policy evaluation with gumbel-max structural causal models , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  34. [34]

    The lancet digital health , volume=

    The false hope of current approaches to explainable artificial intelligence in health care , author=. The lancet digital health , volume=. 2021 , publisher=

  35. [35]

    The Lancet Digital Health , volume=

    The myth of generalisability in clinical research and machine learning in health care , author=. The Lancet Digital Health , volume=. 2020 , publisher=

  36. [36]

    Advances in Neural Information Processing Systems , volume=

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

  37. [37]

    Advances in Neural Information Processing Systems , volume=

    What neural networks memorize and why: Discovering the long tail via influence estimation , author=. Advances in Neural Information Processing Systems , volume=

  38. [38]

    arXiv preprint arXiv:1908.04626 , year=

    Attention is not not explanation , author=. arXiv preprint arXiv:1908.04626 , year=

  39. [39]

    Attention is not Explanation

    Attention is not explanation , author=. arXiv preprint arXiv:1902.10186 , year=

  40. [40]

    International conference on machine learning , pages=

    Axiomatic attribution for deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

  41. [41]

    Theoretical Economics , volume=

    On the falsifiability and learnability of decision theories , author=. Theoretical Economics , volume=. 2020 , publisher=

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    Multimodal few-shot learning with frozen language models , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    2023 , institution=

    Sparse modeling under grouped heterogeneity with an application to asset pricing , author=. 2023 , institution=

  44. [44]

    2024 , journal=

    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author=. 2024 , journal=

  45. [45]

    k-Sparse Autoencoders

    K-sparse autoencoders , author=. arXiv preprint arXiv:1312.5663 , year=

  46. [46]

    Mechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track) , year=

    Unpacking sdxl turbo: Interpreting text-to-image models with sparse autoencoders , author=. Mechanistic Interpretability for Vision at CVPR 2025 (Non-proceedings Track) , year=

  47. [47]

    Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C

    An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation , author=. arXiv preprint arXiv:2410.03334 , year=

  48. [48]

    Scaling and evaluating sparse autoencoders

    Scaling and evaluating sparse autoencoders , author=. arXiv preprint arXiv:2406.04093 , year=

  49. [49]

    arXiv preprint arXiv:2412.06410 , year=

    Batchtopk sparse autoencoders , author=. arXiv preprint arXiv:2412.06410 , year=

  50. [50]

    Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] , shorttitle =

    Chan, Lawrence and. Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] , shorttitle =. 2022 , month = dec, journal =

  51. [51]

    Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , shorttitle =

    Lieberum, Tom and Rahtz, Matthew and Kram. Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , shorttitle =. 2023 , month = jul, journal =. doi:10.48550/arXiv.2307.09458 , url =. arXiv , keywords =:2307.09458 , primaryclass =

  52. [52]

    Locating and Editing Factual Associations in GPT, January 2023

    Locating and Editing Factual Associations in GPT , author =. 2022 , journal =. doi:10.48550/arXiv.2202.05262 , url =. arXiv , keywords =:2202.05262 , primaryclass =

  53. [53]

    2019 , month = oct, journal =

    Definitions, methods, and applications in interpretable machine learning , author =. 2019 , month = oct, journal =

  54. [54]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , shorttitle =

    Frankle, Jonathan and Carbin, Michael , year =. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , shorttitle =. ICLR , url =

  55. [55]

    2020 , journal =

    Supermasks in superposition , author =. 2020 , journal =

  56. [56]

    On Interpretability of Artificial Neural Networks: A Survey , shorttitle =

    Fan, Feng-Lei and Xiong, Jinjun and Li, Mengzhou and Wang, Ge , year =. On Interpretability of Artificial Neural Networks: A Survey , shorttitle =. IEEE Trans. Radiat. Plasma Med. Sci. , volume =. doi:10.1109/TRPMS.2021.3066428 , url =

  57. [57]

    2021 , month = oct, journal =

    A Survey on Neural Network Interpretability , author =. 2021 , month = oct, journal =. doi:10.1109/TETCI.2021.3100641 , url =

  58. [58]

    and Kelley, Troy D

    Chakraborty, Supriyo and Tomsett, Richard and Raghavendra, Ramya and Harborne, Daniel and Alzantot, Moustafa and Cerutti, Federico and Srivastava, Mani and Preece, Alun and Julier, Simon and Rao, Raghuveer M. and Kelley, Troy D. and Braines, Dave and Sensoy, Murat and Willis, Christopher J. and Gurram, Prudhvi , year =. Interpretability of deep learning m...

  59. [59]

    2022 , month = may, journal =

    A Scoring Method for Interpretability of Concepts in Convolutional Neural Networks , author =. 2022 , month = may, journal =. doi:10.1109/SIU55565.2022.9864930 , url =

  60. [60]

    Network Dissection: Quantifying Interpretability of Deep Visual Representations , shorttitle =

    Bau, David and Zhou, Bolei and Khosla, Aditya and Oliva, Aude and Torralba, Antonio , year =. Network Dissection: Quantifying Interpretability of Deep Visual Representations , shorttitle =. CVPR , pages =. doi:10.1109/CVPR.2017.354 , url =

  61. [61]

    Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning , shorttitle =

    Yeom, Seul-Ki and Seegerer, Philipp and Lapuschkin, Sebastian and Binder, Alexander and Wiedemann, Simon and M. Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning , shorttitle =. 2021 , month = jul, journal =. doi:10.1016/j.patcog.2021.107899 , url =. arXiv , keywords =:1912.08881 , primaryclass =

  62. [62]

    2020 , journal =

    On Identifiability in Transformers , author =. 2020 , journal =

  63. [63]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , shorttitle =

    Gal, Yarin and Ghahramani, Zoubin , year =. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , shorttitle =. ICML , url =

  64. [64]

    2019 , month = oct, journal =

    Human Evaluation of Models Built for Interpretability , author =. 2019 , month = oct, journal =. doi:10.1609/hcomp.v7i1.5280 , url =

  65. [65]

    DeepCuts: Single-Shot Interpretability based Pruning for BERT , shorttitle =

    Grover, Jasdeep Singh and Gawri, Bhavesh and Manku, Ruskin Raj , year =. DeepCuts: Single-Shot Interpretability based Pruning for BERT , shorttitle =. CoRR , publisher =. doi:10.48550/ARXIV.2212.13392 , url =

  66. [66]

    2018 , month = jun, journal =

    Towards Robust Interpretability with Self-Explaining Neural Networks , author =. 2018 , month = jun, journal =

  67. [67]

    and Guo, Hong , year =

    Lalor, John P. and Guo, Hong , year =. Measuring algorithmic interpretability: A human-learning-based framework and the corresponding cognitive complexity score , shorttitle =. CoRR , publisher =. doi:10.48550/ARXIV.2205.10207 , url =

  68. [68]

    Automated measures for interpretable dimensionality reduction for visual classification: A user study , shorttitle =

    Icke, Ilknur and Rosenberg, Andrew , year =. Automated measures for interpretable dimensionality reduction for visual classification: A user study , shorttitle =. VAST , pages =. doi:10.1109/VAST.2011.6102474 , url =

  69. [69]

    2023 , archivePrefix=

    Localizing Model Behavior with Path Patching , author =. 2023 , journal =. doi:10.48550/ARXIV.2304.05969 , url =

  70. [70]

    2023 , month = mar, journal =

    Evaluation of post-hoc interpretability methods in time-series classification , author =. 2023 , month = mar, journal =. doi:10.1038/s42256-023-00620-w , url =

  71. [71]

    2023 , journal =

    Global Concept-Based Interpretability for Graph Neural Networks via Neuron Analysis , author =. 2023 , journal =. doi:10.48550/ARXIV.2208.10609 , url =

  72. [72]

    2023 , month = aug, journal =

    To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review , shorttitle =. 2023 , month = aug, journal =. doi:10.48550/arXiv.2304.09355 , url =. arXiv , keywords =:2304.09355 , primaryclass =

  73. [73]

    FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation , shorttitle =

    Pai, Dhruv and Carranza, Andres and Schaeffer, Rylan and Tandon, Arnuv and Koyejo, Sanmi , year =. FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation , shorttitle =. CoRR , eprint =. doi:10.48550/arXiv.2307.10563 , url =

  74. [74]

    2023 , month = jul, journal =

    Discovering Variable Binding Circuitry with Desiderata , author =. 2023 , month = jul, journal =. doi:10.48550/arXiv.2307.03637 , url =. arXiv , keywords =:2307.03637 , primaryclass =

  75. [75]

    2023 , journal =

    Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author =. 2023 , journal =. doi:10.48550/ARXIV.2303.02536 , url =

  76. [76]

    Causal abstraction for faithful model interpretation

    Causal Abstraction for Faithful Model Interpretation , author =. 2023 , month = jan, journal =. doi:10.48550/arXiv.2301.04709 , url =. arXiv , keywords =:2301.04709 , primaryclass =

  77. [77]

    org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. 2023 , journal =. doi:10.48550/ARXIV.2304.14997 , url =

  78. [78]

    2023 , journal =

    Red Teaming Deep Neural Networks with Feature Synthesis Tools , author =. 2023 , journal =. doi:10.48550/ARXIV.2302.10894 , url =

  79. [79]

    2023 , journal =

    Language models can explain neurons in language models , author =. 2023 , journal =

  80. [80]

    Mass-Editing Memory in a Transformer

    Mass-Editing Memory in a Transformer , author =. 2022 , journal =. doi:10.48550/arXiv.2210.07229 , url =. arXiv , keywords =:2210.07229 , primaryclass =

Showing first 80 references.