Recognition: no theorem link
Activation Differences Reveal Backdoors: A Comparison of SAE Architectures
Pith reviewed 2026-05-11 01:51 UTC · model grok-4.3
The pith
Differential SAEs isolate backdoors in language models far better than crosscoders by capturing activation differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a controlled experiment with a year-based trigger for SQL injection in the SmolLM2-360M model, Diff-SAE achieves a Backdoor Isolation Score of 0.40 with precision of 1.0 and zero false positive rate across layers 14, 18, 22, and 26 and both LoRA and full fine-tuning. Crosscoders yield BIS below 0.02 in most conditions. The paper concludes that backdoors manifest as directional activation shifts, which difference-based representations detect more effectively than sparse feature approaches.
What carries the argument
Diff-SAE, the differential sparse autoencoder that reconstructs the difference between activations on backdoor-triggered inputs and clean inputs to isolate the backdoor direction.
If this is right
- Backdoors are better isolated using difference-based methods than standard sparse autoencoders.
- Full-rank fine-tuning produces cleaner backdoor signals than LoRA fine-tuning.
- The performance advantage of Diff-SAE holds across multiple layers in the model.
- Zero false positives in Diff-SAE mean the isolated features are highly specific to the backdoor.
Where Pith is reading between the lines
- Testing the same method on backdoors with different trigger types, such as non-numeric patterns, would show if the directional shift pattern is general.
- These difference-based tools could be combined with other interpretability methods to create more robust safety checks for deployed models.
- Since crosscoders fail, standard sparse features learned on normal data may miss manipulated behaviors.
Load-bearing premise
Backdoors primarily appear as directional shifts in model activations rather than as distinct sparse features that can be isolated without differences.
What would settle it
Finding a backdoor where crosscoders achieve high isolation scores while Diff-SAE does not, or where Diff-SAE has high false positives on a new trigger type.
Figures
read the original abstract
Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares Crosscoders and Differential SAEs (Diff-SAE) for isolating backdoor-related features in fine-tuned language models. Using a controlled year-triggered SQL injection backdoor in SmolLM2-360M under both LoRA and full-rank fine-tuning, it reports that Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across layers 14, 18, 22, and 26, while Crosscoders yield BIS values below 0.02 in most conditions. The authors conclude that backdoors manifest as directional activation shifts rather than sparse feature activations, favoring difference-based representations for detection.
Significance. If the quantitative comparison holds under the reported controls, the work supplies a useful empirical benchmark for SAE architectures in backdoor detection tasks within mechanistic interpretability. The controlled experimental design and layer-wise results provide concrete data points that could inform tool development for AI safety monitoring. However, the broader significance for general backdoor detection is constrained by the narrow scope of the tested trigger and model.
major comments (2)
- [Abstract] Abstract: The claim that backdoors 'manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective' rests entirely on results from a single year-triggered SQL injection backdoor in SmolLM2-360M. This does not demonstrate the pattern for other mechanisms (e.g., token-sequence triggers or output-targeted behaviors), so the architectural superiority cannot be separated from the activation signature of this specific backdoor.
- [Methods] Methods (or equivalent experimental section): The reported BIS, precision, and FPR values are central to the performance claims, yet the abstract provides no definitions of these metrics, no error bars, no statistical tests, and no details on how false positives are controlled or how the backdoor isolation is quantified. These omissions make it impossible to assess whether the perfect scores are robust or sensitive to implementation choices.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our comparison of Crosscoders and Differential SAEs for backdoor isolation. We address each major comment below and have revised the manuscript to improve clarity and qualify our claims appropriately.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that backdoors 'manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective' rests entirely on results from a single year-triggered SQL injection backdoor in SmolLM2-360M. This does not demonstrate the pattern for other mechanisms (e.g., token-sequence triggers or output-targeted behaviors), so the architectural superiority cannot be separated from the activation signature of this specific backdoor.
Authors: We agree that the original abstract phrasing overgeneralized from a single controlled backdoor (year-triggered SQL injection in SmolLM2-360M). The results demonstrate a clear performance gap for this activation signature, but we cannot claim the pattern holds for all backdoor mechanisms. In the revised manuscript we have updated the abstract to read: 'in this controlled setting, our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations more effective for detection in this case.' We have also added a limitations paragraph in the discussion explicitly noting the need for validation on token-sequence triggers, output-targeted behaviors, and additional models. revision: yes
-
Referee: [Methods] Methods (or equivalent experimental section): The reported BIS, precision, and FPR values are central to the performance claims, yet the abstract provides no definitions of these metrics, no error bars, no statistical tests, and no details on how false positives are controlled or how the backdoor isolation is quantified. These omissions make it impossible to assess whether the perfect scores are robust or sensitive to implementation choices.
Authors: We acknowledge that the abstract omitted concise definitions and robustness details. The full definitions of the Backdoor Isolation Score (BIS), precision, and false-positive rate (FPR), along with the exact quantification procedure (comparing feature activations on clean vs. triggered inputs and measuring isolation via precision at perfect recall), appear in Section 3.2 of the manuscript. To address the referee's concern we have added a one-sentence definition of BIS and a pointer to the methods in the revised abstract. Regarding error bars and statistical tests: because SAE training and evaluation on the fixed dataset are deterministic, we report layer-wise consistency (layers 14, 18, 22, 26) rather than stochastic variance; we have now included these per-layer values explicitly in the results table and added a brief note on the deterministic nature of the pipeline. We have also expanded the methods section with a paragraph detailing false-positive control (thresholding at zero activation on clean inputs) and the exact BIS formula. revision: yes
Circularity Check
Purely empirical comparison; no derivations or self-referential reductions
full rationale
The paper reports experimental results from comparing Diff-SAE and Crosscoders on one controlled SQL-injection backdoor in SmolLM2-360M under two fine-tuning regimes. All reported metrics (BIS, precision, FPR) are direct measurements from activation data; no equations, fitted parameters renamed as predictions, or self-citations are used to derive the central claims. The post-hoc suggestion that backdoors manifest as directional shifts is an interpretation of the observed performance gap rather than a load-bearing derivation that reduces to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Backdoors manifest as directional activation shifts detectable by difference-based SAE representations
Reference graph
Works this paper leans on
-
[1]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, et al., “Sleeper agents: Training deceptive LLMs that persist through safety training,” arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” arXiv preprint arXiv:2309.08600, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Bricken, et al., ”Towards Monosemanticity: Decomposing Language Models With Dictionary Learning”, Transformer Circuits Thread, 2023
work page 2023
-
[4]
Crosscoders: Sparse autoencoders for cross-model feature analysis,
J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, A. Templeton, et al., “Crosscoders: Sparse autoencoders for cross-model feature analysis,” Transformer Circuits Thread, 2024
work page 2024
-
[5]
A backdoor attack against LSTM-based text classification systems,
J. Dai, C. Chen, and Y . Li, “A backdoor attack against LSTM-based text classification systems,” IEEE Access, vol. 7, pp. 138872–138878, 2019
work page 2019
-
[6]
BadNL: Backdoor attacks against NLP models with semantic-preserving im- provements,
X. Chen, A. Salem, A. N. Bhagoji, M. Backes, and S. Gong, “BadNL: Backdoor attacks against NLP models with semantic-preserving im- provements,” arXiv preprint arXiv:2006.01043, 2021
-
[7]
Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,
A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, et al., “Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,” Transformer Circuits Thread, 2024
work page 2024
-
[8]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al., “Training language models to follow instructions with human feedback,” NeurIPS, 2022
work page 2022
-
[9]
Constitutional AI: Harmlessness from AI Feedback
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, et al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
T. Gu, B. Dolan-Gavitt, and S. Garg, “BadNets: Identifying vulnera- bilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, 2017
work page internal anchor Pith review arXiv 2017
-
[11]
ONION: A simple and effective defense against textual backdoor attacks,
Y . Qi, S. Xie, and Y . Li, “ONION: A simple and effective defense against textual backdoor attacks,” EMNLP, 2021
work page 2021
-
[12]
Fine-pruning: Defending against backdooring attacks on deep neural networks,
K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” RAID 2018. Lecture Notes in Computer Science(), vol 11050. Springer, 2018
work page 2018
-
[13]
Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,
B. Wang, Y . Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y . Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” Symposium on Security and Privacy (SP), 2019
work page 2019
-
[14]
Locating and editing factual associations in GPT,
K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in GPT,” NeurIPS, 2022
work page 2022
-
[15]
SmolLM2: Compact language models,
HuggingFace, “SmolLM2: Compact language models,” 2024. [Online]. Available: https://huggingface.co/HuggingFaceTB/SmolLM2-360M
work page 2024
-
[16]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, et al., “LoRA: Low-rank adaptation of large language models,” ICLR, 2022
work page 2022
-
[17]
N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, et al., “Softmax linear units,” Transformer Circuits Thread, 2022
work page 2022
-
[18]
Minder, J. et al., ”Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning,” arXiv preprint arXiv:2504.02922, 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.