pith. machine review for the scientific record. sign in

arxiv: 2605.07324 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Recognition: no theorem link

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Sachin Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG
keywords backdoor attackssparse autoencodersdifferential SAEscrosscodersmechanistic interpretabilityAI safetylanguage model fine-tuning
0
0 comments X

The pith

Differential SAEs isolate backdoors in language models far better than crosscoders by capturing activation differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two types of sparse autoencoders to find backdoors hidden in fine-tuned language models. Diff-SAE, which focuses on activation differences, isolates the backdoor with a score of 0.40 and perfect precision in most tests. Crosscoders almost completely fail at the same task. The results point to backdoors showing up as shifts in how the model activates rather than as separate features. This matters for building tools that can spot when models have been secretly modified to behave badly on certain triggers.

Core claim

Using a controlled experiment with a year-based trigger for SQL injection in the SmolLM2-360M model, Diff-SAE achieves a Backdoor Isolation Score of 0.40 with precision of 1.0 and zero false positive rate across layers 14, 18, 22, and 26 and both LoRA and full fine-tuning. Crosscoders yield BIS below 0.02 in most conditions. The paper concludes that backdoors manifest as directional activation shifts, which difference-based representations detect more effectively than sparse feature approaches.

What carries the argument

Diff-SAE, the differential sparse autoencoder that reconstructs the difference between activations on backdoor-triggered inputs and clean inputs to isolate the backdoor direction.

If this is right

  • Backdoors are better isolated using difference-based methods than standard sparse autoencoders.
  • Full-rank fine-tuning produces cleaner backdoor signals than LoRA fine-tuning.
  • The performance advantage of Diff-SAE holds across multiple layers in the model.
  • Zero false positives in Diff-SAE mean the isolated features are highly specific to the backdoor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Testing the same method on backdoors with different trigger types, such as non-numeric patterns, would show if the directional shift pattern is general.
  • These difference-based tools could be combined with other interpretability methods to create more robust safety checks for deployed models.
  • Since crosscoders fail, standard sparse features learned on normal data may miss manipulated behaviors.

Load-bearing premise

Backdoors primarily appear as directional shifts in model activations rather than as distinct sparse features that can be isolated without differences.

What would settle it

Finding a backdoor where crosscoders achieve high isolation scores while Diff-SAE does not, or where Diff-SAE has high false positives on a new trigger type.

Figures

Figures reproduced from arXiv: 2605.07324 by Sachin Kumar.

Figure 1
Figure 1. Figure 1: Backdoor behavior comparison. The trigger year (2024) [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript compares Crosscoders and Differential SAEs (Diff-SAE) for isolating backdoor-related features in fine-tuned language models. Using a controlled year-triggered SQL injection backdoor in SmolLM2-360M under both LoRA and full-rank fine-tuning, it reports that Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across layers 14, 18, 22, and 26, while Crosscoders yield BIS values below 0.02 in most conditions. The authors conclude that backdoors manifest as directional activation shifts rather than sparse feature activations, favoring difference-based representations for detection.

Significance. If the quantitative comparison holds under the reported controls, the work supplies a useful empirical benchmark for SAE architectures in backdoor detection tasks within mechanistic interpretability. The controlled experimental design and layer-wise results provide concrete data points that could inform tool development for AI safety monitoring. However, the broader significance for general backdoor detection is constrained by the narrow scope of the tested trigger and model.

major comments (2)
  1. [Abstract] Abstract: The claim that backdoors 'manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective' rests entirely on results from a single year-triggered SQL injection backdoor in SmolLM2-360M. This does not demonstrate the pattern for other mechanisms (e.g., token-sequence triggers or output-targeted behaviors), so the architectural superiority cannot be separated from the activation signature of this specific backdoor.
  2. [Methods] Methods (or equivalent experimental section): The reported BIS, precision, and FPR values are central to the performance claims, yet the abstract provides no definitions of these metrics, no error bars, no statistical tests, and no details on how false positives are controlled or how the backdoor isolation is quantified. These omissions make it impossible to assess whether the perfect scores are robust or sensitive to implementation choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our comparison of Crosscoders and Differential SAEs for backdoor isolation. We address each major comment below and have revised the manuscript to improve clarity and qualify our claims appropriately.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that backdoors 'manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective' rests entirely on results from a single year-triggered SQL injection backdoor in SmolLM2-360M. This does not demonstrate the pattern for other mechanisms (e.g., token-sequence triggers or output-targeted behaviors), so the architectural superiority cannot be separated from the activation signature of this specific backdoor.

    Authors: We agree that the original abstract phrasing overgeneralized from a single controlled backdoor (year-triggered SQL injection in SmolLM2-360M). The results demonstrate a clear performance gap for this activation signature, but we cannot claim the pattern holds for all backdoor mechanisms. In the revised manuscript we have updated the abstract to read: 'in this controlled setting, our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations more effective for detection in this case.' We have also added a limitations paragraph in the discussion explicitly noting the need for validation on token-sequence triggers, output-targeted behaviors, and additional models. revision: yes

  2. Referee: [Methods] Methods (or equivalent experimental section): The reported BIS, precision, and FPR values are central to the performance claims, yet the abstract provides no definitions of these metrics, no error bars, no statistical tests, and no details on how false positives are controlled or how the backdoor isolation is quantified. These omissions make it impossible to assess whether the perfect scores are robust or sensitive to implementation choices.

    Authors: We acknowledge that the abstract omitted concise definitions and robustness details. The full definitions of the Backdoor Isolation Score (BIS), precision, and false-positive rate (FPR), along with the exact quantification procedure (comparing feature activations on clean vs. triggered inputs and measuring isolation via precision at perfect recall), appear in Section 3.2 of the manuscript. To address the referee's concern we have added a one-sentence definition of BIS and a pointer to the methods in the revised abstract. Regarding error bars and statistical tests: because SAE training and evaluation on the fixed dataset are deterministic, we report layer-wise consistency (layers 14, 18, 22, 26) rather than stochastic variance; we have now included these per-layer values explicitly in the results table and added a brief note on the deterministic nature of the pipeline. We have also expanded the methods section with a paragraph detailing false-positive control (thresholding at zero activation on clean inputs) and the exact BIS formula. revision: yes

Circularity Check

0 steps flagged

Purely empirical comparison; no derivations or self-referential reductions

full rationale

The paper reports experimental results from comparing Diff-SAE and Crosscoders on one controlled SQL-injection backdoor in SmolLM2-360M under two fine-tuning regimes. All reported metrics (BIS, precision, FPR) are direct measurements from activation data; no equations, fitted parameters renamed as predictions, or self-citations are used to derive the central claims. The post-hoc suggestion that backdoors manifest as directional shifts is an interpretation of the observed performance gap rather than a load-bearing derivation that reduces to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that backdoor behavior appears as consistent directional activation differences rather than as new sparse features, plus the unstated assumption that the controlled SQL-injection trigger is representative of real backdoors.

axioms (1)
  • domain assumption Backdoors manifest as directional activation shifts detectable by difference-based SAE representations
    Invoked in the final sentence of the abstract to explain why Diff-SAE succeeds.

pith-pipeline@v0.9.0 · 5552 in / 1276 out tokens · 32167 ms · 2026-05-11T01:51:16.662677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, et al., “Sleeper agents: Training deceptive LLMs that persist through safety training,” arXiv preprint arXiv:2401.05566, 2024

  2. [2]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” arXiv preprint arXiv:2309.08600, 2023

  3. [3]

    Bricken, et al., ”Towards Monosemanticity: Decomposing Language Models With Dictionary Learning”, Transformer Circuits Thread, 2023

  4. [4]

    Crosscoders: Sparse autoencoders for cross-model feature analysis,

    J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, A. Templeton, et al., “Crosscoders: Sparse autoencoders for cross-model feature analysis,” Transformer Circuits Thread, 2024

  5. [5]

    A backdoor attack against LSTM-based text classification systems,

    J. Dai, C. Chen, and Y . Li, “A backdoor attack against LSTM-based text classification systems,” IEEE Access, vol. 7, pp. 138872–138878, 2019

  6. [6]

    BadNL: Backdoor attacks against NLP models with semantic-preserving im- provements,

    X. Chen, A. Salem, A. N. Bhagoji, M. Backes, and S. Gong, “BadNL: Backdoor attacks against NLP models with semantic-preserving im- provements,” arXiv preprint arXiv:2006.01043, 2021

  7. [7]

    Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,

    A. Templeton, T. Conerly, J. Marcus, J. Lindsey, T. Bricken, B. Chen, et al., “Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet,” Transformer Circuits Thread, 2024

  8. [8]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, et al., “Training language models to follow instructions with human feedback,” NeurIPS, 2022

  9. [9]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, et al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022

  10. [10]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    T. Gu, B. Dolan-Gavitt, and S. Garg, “BadNets: Identifying vulnera- bilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, 2017

  11. [11]

    ONION: A simple and effective defense against textual backdoor attacks,

    Y . Qi, S. Xie, and Y . Li, “ONION: A simple and effective defense against textual backdoor attacks,” EMNLP, 2021

  12. [12]

    Fine-pruning: Defending against backdooring attacks on deep neural networks,

    K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” RAID 2018. Lecture Notes in Computer Science(), vol 11050. Springer, 2018

  13. [13]

    Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,

    B. Wang, Y . Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y . Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” Symposium on Security and Privacy (SP), 2019

  14. [14]

    Locating and editing factual associations in GPT,

    K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in GPT,” NeurIPS, 2022

  15. [15]

    SmolLM2: Compact language models,

    HuggingFace, “SmolLM2: Compact language models,” 2024. [Online]. Available: https://huggingface.co/HuggingFaceTB/SmolLM2-360M

  16. [16]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, et al., “LoRA: Low-rank adaptation of large language models,” ICLR, 2022

  17. [17]

    Softmax linear units,

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, et al., “Softmax linear units,” Transformer Circuits Thread, 2022

  18. [18]

    Robustly identifying concepts introduced during chat fine-tuning using crosscoders.arXiv preprint arXiv:2504.02922,

    Minder, J. et al., ”Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning,” arXiv preprint arXiv:2504.02922, 2026