arxiv: 2605.07324 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Recognition: no theorem link

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Sachin Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG

keywords backdoor attackssparse autoencodersdifferential SAEscrosscodersmechanistic interpretabilityAI safetylanguage model fine-tuning

0 comments

The pith

Differential SAEs isolate backdoors in language models far better than crosscoders by capturing activation differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two types of sparse autoencoders to find backdoors hidden in fine-tuned language models. Diff-SAE, which focuses on activation differences, isolates the backdoor with a score of 0.40 and perfect precision in most tests. Crosscoders almost completely fail at the same task. The results point to backdoors showing up as shifts in how the model activates rather than as separate features. This matters for building tools that can spot when models have been secretly modified to behave badly on certain triggers.

Core claim

Using a controlled experiment with a year-based trigger for SQL injection in the SmolLM2-360M model, Diff-SAE achieves a Backdoor Isolation Score of 0.40 with precision of 1.0 and zero false positive rate across layers 14, 18, 22, and 26 and both LoRA and full fine-tuning. Crosscoders yield BIS below 0.02 in most conditions. The paper concludes that backdoors manifest as directional activation shifts, which difference-based representations detect more effectively than sparse feature approaches.

What carries the argument

Diff-SAE, the differential sparse autoencoder that reconstructs the difference between activations on backdoor-triggered inputs and clean inputs to isolate the backdoor direction.

If this is right

Backdoors are better isolated using difference-based methods than standard sparse autoencoders.
Full-rank fine-tuning produces cleaner backdoor signals than LoRA fine-tuning.
The performance advantage of Diff-SAE holds across multiple layers in the model.
Zero false positives in Diff-SAE mean the isolated features are highly specific to the backdoor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Testing the same method on backdoors with different trigger types, such as non-numeric patterns, would show if the directional shift pattern is general.
These difference-based tools could be combined with other interpretability methods to create more robust safety checks for deployed models.
Since crosscoders fail, standard sparse features learned on normal data may miss manipulated behaviors.

Load-bearing premise

Backdoors primarily appear as directional shifts in model activations rather than as distinct sparse features that can be isolated without differences.

What would settle it

Finding a backdoor where crosscoders achieve high isolation scores while Diff-SAE does not, or where Diff-SAE has high false positives on a new trigger type.

Figures

Figures reproduced from arXiv: 2605.07324 by Sachin Kumar.

read the original abstract

Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Diff-SAE beats Crosscoders on this one backdoor in SmolLM2 but the directional-shift conclusion stays tied to that single trigger and model.

read the letter

The key point is that Diff-SAE isolates the year-triggered SQL injection backdoor more effectively than Crosscoders in the reported runs on SmolLM2-360M. It reaches a Backdoor Isolation Score of 0.40 with perfect precision and zero false positives across most layers and both LoRA and full-rank fine-tuning, while Crosscoders stay below 0.02 BIS in nearly all conditions. Full-rank tuning produces the cleanest signals. That is a usable empirical comparison of the two architectures on a controlled task.

Referee Report

2 major / 0 minor

Summary. The manuscript compares Crosscoders and Differential SAEs (Diff-SAE) for isolating backdoor-related features in fine-tuned language models. Using a controlled year-triggered SQL injection backdoor in SmolLM2-360M under both LoRA and full-rank fine-tuning, it reports that Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across layers 14, 18, 22, and 26, while Crosscoders yield BIS values below 0.02 in most conditions. The authors conclude that backdoors manifest as directional activation shifts rather than sparse feature activations, favoring difference-based representations for detection.

Significance. If the quantitative comparison holds under the reported controls, the work supplies a useful empirical benchmark for SAE architectures in backdoor detection tasks within mechanistic interpretability. The controlled experimental design and layer-wise results provide concrete data points that could inform tool development for AI safety monitoring. However, the broader significance for general backdoor detection is constrained by the narrow scope of the tested trigger and model.

major comments (2)

[Abstract] Abstract: The claim that backdoors 'manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective' rests entirely on results from a single year-triggered SQL injection backdoor in SmolLM2-360M. This does not demonstrate the pattern for other mechanisms (e.g., token-sequence triggers or output-targeted behaviors), so the architectural superiority cannot be separated from the activation signature of this specific backdoor.
[Methods] Methods (or equivalent experimental section): The reported BIS, precision, and FPR values are central to the performance claims, yet the abstract provides no definitions of these metrics, no error bars, no statistical tests, and no details on how false positives are controlled or how the backdoor isolation is quantified. These omissions make it impossible to assess whether the perfect scores are robust or sensitive to implementation choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our comparison of Crosscoders and Differential SAEs for backdoor isolation. We address each major comment below and have revised the manuscript to improve clarity and qualify our claims appropriately.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that backdoors 'manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective' rests entirely on results from a single year-triggered SQL injection backdoor in SmolLM2-360M. This does not demonstrate the pattern for other mechanisms (e.g., token-sequence triggers or output-targeted behaviors), so the architectural superiority cannot be separated from the activation signature of this specific backdoor.

Authors: We agree that the original abstract phrasing overgeneralized from a single controlled backdoor (year-triggered SQL injection in SmolLM2-360M). The results demonstrate a clear performance gap for this activation signature, but we cannot claim the pattern holds for all backdoor mechanisms. In the revised manuscript we have updated the abstract to read: 'in this controlled setting, our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations more effective for detection in this case.' We have also added a limitations paragraph in the discussion explicitly noting the need for validation on token-sequence triggers, output-targeted behaviors, and additional models. revision: yes
Referee: [Methods] Methods (or equivalent experimental section): The reported BIS, precision, and FPR values are central to the performance claims, yet the abstract provides no definitions of these metrics, no error bars, no statistical tests, and no details on how false positives are controlled or how the backdoor isolation is quantified. These omissions make it impossible to assess whether the perfect scores are robust or sensitive to implementation choices.

Authors: We acknowledge that the abstract omitted concise definitions and robustness details. The full definitions of the Backdoor Isolation Score (BIS), precision, and false-positive rate (FPR), along with the exact quantification procedure (comparing feature activations on clean vs. triggered inputs and measuring isolation via precision at perfect recall), appear in Section 3.2 of the manuscript. To address the referee's concern we have added a one-sentence definition of BIS and a pointer to the methods in the revised abstract. Regarding error bars and statistical tests: because SAE training and evaluation on the fixed dataset are deterministic, we report layer-wise consistency (layers 14, 18, 22, 26) rather than stochastic variance; we have now included these per-layer values explicitly in the results table and added a brief note on the deterministic nature of the pipeline. We have also expanded the methods section with a paragraph detailing false-positive control (thresholding at zero activation on clean inputs) and the exact BIS formula. revision: yes

Circularity Check

0 steps flagged

Purely empirical comparison; no derivations or self-referential reductions

full rationale

The paper reports experimental results from comparing Diff-SAE and Crosscoders on one controlled SQL-injection backdoor in SmolLM2-360M under two fine-tuning regimes. All reported metrics (BIS, precision, FPR) are direct measurements from activation data; no equations, fitted parameters renamed as predictions, or self-citations are used to derive the central claims. The post-hoc suggestion that backdoors manifest as directional shifts is an interpretation of the observed performance gap rather than a load-bearing derivation that reduces to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that backdoor behavior appears as consistent directional activation differences rather than as new sparse features, plus the unstated assumption that the controlled SQL-injection trigger is representative of real backdoors.

axioms (1)

domain assumption Backdoors manifest as directional activation shifts detectable by difference-based SAE representations
Invoked in the final sentence of the abstract to explain why Diff-SAE succeeds.

pith-pipeline@v0.9.0 · 5552 in / 1276 out tokens · 32167 ms · 2026-05-11T01:51:16.662677+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

[1]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, et al., “Sleeper agents: Training deceptive LLMs that persist through safety training,” arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review arXiv 2024
[2]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Bricken, et al., ”Towards Monosemanticity: Decomposing Language Models With Dictionary Learning”, Transformer Circuits Thread, 2023

work page 2023
[4]

Crosscoders: Sparse autoencoders for cross-model feature analysis,

J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, A. Templeton, et al., “Crosscoders: Sparse autoencoders for cross-model feature analysis,” Transformer Circuits Thread, 2024

work page 2024
[5]

A backdoor attack against LSTM-based text classification systems,

J. Dai, C. Chen, and Y . Li, “A backdoor attack against LSTM-based text classification systems,” IEEE Access, vol. 7, pp. 138872–138878, 2019

work page 2019
[6]

BadNL: Backdoor attacks against NLP models with semantic-preserving im- provements,

X. Chen, A. Salem, A. N. Bhagoji, M. Backes, and S. Gong, “BadNL: Backdoor attacks against NLP models with semantic-preserving im- provements,” arXiv preprint arXiv:2006.01043, 2021

work page arXiv 2006
[7]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,

A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, et al., “Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,” Transformer Circuits Thread, 2024

work page 2024
[8]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al., “Training language models to follow instructions with human feedback,” NeurIPS, 2022

work page 2022
[9]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, et al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

T. Gu, B. Dolan-Gavitt, and S. Garg, “BadNets: Identifying vulnera- bilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, 2017

work page internal anchor Pith review arXiv 2017
[11]

ONION: A simple and effective defense against textual backdoor attacks,

Y . Qi, S. Xie, and Y . Li, “ONION: A simple and effective defense against textual backdoor attacks,” EMNLP, 2021

work page 2021
[12]

Fine-pruning: Defending against backdooring attacks on deep neural networks,

K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” RAID 2018. Lecture Notes in Computer Science(), vol 11050. Springer, 2018

work page 2018
[13]

Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,

B. Wang, Y . Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y . Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” Symposium on Security and Privacy (SP), 2019

work page 2019
[14]

Locating and editing factual associations in GPT,

K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in GPT,” NeurIPS, 2022

work page 2022
[15]

SmolLM2: Compact language models,

HuggingFace, “SmolLM2: Compact language models,” 2024. [Online]. Available: https://huggingface.co/HuggingFaceTB/SmolLM2-360M

work page 2024
[16]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, et al., “LoRA: Low-rank adaptation of large language models,” ICLR, 2022

work page 2022
[17]

Softmax linear units,

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, et al., “Softmax linear units,” Transformer Circuits Thread, 2022

work page 2022
[18]

Robustly identifying concepts introduced during chat fine-tuning using crosscoders.arXiv preprint arXiv:2504.02922,

Minder, J. et al., ”Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning,” arXiv preprint arXiv:2504.02922, 2026

work page arXiv 2026