pith. sign in

hub Mixed citations

Localizing Model Behavior with Path Patching

Mixed citation behavior. Most common role is background (50%).

24 Pith papers citing it
Background 50% of classified citations
abstract

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

hub tools

citation-role summary

background 3 method 3

citation-polarity summary

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Instructions Shape Production of Language, not Processing

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

Patch-Effect Graph Kernels for LLM Interpretability

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.

Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.

How to use and interpret activation patching

cs.LG · 2024-04-23 · accept · novelty 5.0

Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

citing papers explorer

Showing 24 of 24 citing papers.