Coactivation of sparse autoencoder features reveals causal semantic modules for concepts and relations in LLMs that can be ablated or amplified to produce predictable and counterfactual changes in outputs.
Transformers: State-of-the-art natural language processing
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models
Coactivation of sparse autoencoder features reveals causal semantic modules for concepts and relations in LLMs that can be ablated or amplified to produce predictable and counterfactual changes in outputs.