hub Mixed citations

Localizing Model Behavior with Path Patching

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, Aryaman Arora · 2023 · cs.LG · arXiv 2304.05969

Mixed citation behavior. Most common role is background (50%).

24 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 24 citing papers arXiv PDF

abstract

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 3

citation-polarity summary

background 3 use method 3

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.

Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.

In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

cs.LG · 2026-05-08 · conditional · novelty 7.0

In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Transformer represents but does not causally transmit staged algorithmic intermediates for base-digit extraction, diverging from probe predictions.

Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

Instructions Shape Production of Language, not Processing

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

Patch-Effect Graph Kernels for LLM Interpretability

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

cs.AI · 2026-04-20 · conditional · novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.

Automated Attention Pattern Discovery at Scale in Large Language Models

cs.LG · 2026-04-04 · unverdicted · novelty 6.0

AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

An empirical audit identifies a strong SAE feature correlate for GPT-2 small failures on 'keys' prompts in the IOI task, performs ablation and baseline controls showing it is not causal, and presents the audit pipeline as the primary contribution.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

How to use and interpret activation patching

cs.LG · 2024-04-23 · accept · novelty 5.0

Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

cs.LG · 2023-09-27 · unverdicted · novelty 5.0

Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

cs.CL · 2026-05-19

Language-Switching Triggers Take a Latent Detour Through Language Models

cs.CL · 2026-05-18

High-Dimensional Statistics: Reflections on Progress and Open Problems

math.ST · 2026-05-06

How Language Models Process Negation

cs.CL · 2026-05-04

citing papers explorer

Showing 24 of 24 citing papers.

WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 61 · 3 links · internal anchor
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach cs.LG · 2026-05-20 · unverdicted · none · ref 46 · internal anchor
Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs cs.CL · 2026-05-10 · unverdicted · none · ref 5 · internal anchor
LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification cs.LG · 2026-05-08 · conditional · none · ref 15 · internal anchor
In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training cs.CL · 2026-05-07 · unverdicted · none · ref 35 · internal anchor
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 4 · 2 links · internal anchor
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer cs.LG · 2026-05-21 · unverdicted · none · ref 10 · internal anchor
Transformer represents but does not causally transmit staged algorithmic intermediates for base-digit extraction, diverging from probe predictions.
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 69 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel cs.AI · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Instructions Shape Production of Language, not Processing cs.CL · 2026-05-11 · unverdicted · none · ref 112 · 2 links · internal anchor
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
Patch-Effect Graph Kernels for LLM Interpretability cs.AI · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks cs.AI · 2026-04-20 · conditional · none · ref 27 · internal anchor
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs cs.AI · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
Automated Attention Pattern Discovery at Scale in Large Language Models cs.LG · 2026-04-04 · unverdicted · none · ref 10 · internal anchor
AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification cs.LG · 2026-05-21 · unverdicted · none · ref 17 · internal anchor
An empirical audit identifies a strong SAE feature correlate for GPT-2 small failures on 'keys' prompts in the IOI task, performs ablation and baseline controls showing it is not causal, and presents the audit pipeline as the primary contribution.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 88 · internal anchor
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
How to use and interpret activation patching cs.LG · 2024-04-23 · accept · none · ref 10 · internal anchor
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods cs.LG · 2023-09-27 · unverdicted · none · ref 84 · internal anchor
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
Where Does Authorship Signal Emerge in Encoder-Based Language Models? cs.CL · 2026-05-19 · unreviewed · ref 6 · internal anchor
Language-Switching Triggers Take a Latent Detour Through Language Models cs.CL · 2026-05-18 · unreviewed · ref 13 · internal anchor
High-Dimensional Statistics: Reflections on Progress and Open Problems math.ST · 2026-05-06 · unreviewed · ref 33 · internal anchor
How Language Models Process Negation cs.CL · 2026-05-04 · unreviewed · ref 47 · internal anchor

Localizing Model Behavior with Path Patching

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer