LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
arXiv preprint arXiv:2511.13653 , year =
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.
MoE expert neurons show lower polysemanticity than dense FFN neurons, widening with sparser routing, and experts specialize in fine-grained tasks like specific linguistic operations, supporting expert-level interpretability.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.
citing papers explorer
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
From Mechanistic to Compositional Interpretability
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Bilinear autoencoders find interpretable manifolds
Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.
-
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
MoE expert neurons show lower polysemanticity than dense FFN neurons, widening with sparser routing, and experts specialize in fine-grained tasks like specific linguistic operations, supporting expert-level interpretability.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
-
Internal Deployment in the AI Act
Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.