arxiv: 2605.07447 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Daisuke Kawahara, Hao Wang, Lawrence B. Hsieh, Pengfei Wei, Yiqun Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords sparse autoencodersadversarial detectionvision-language modelsreconstruction objectivesplug-and-play safetyadversarial robustness

0 comments

The pith

Sparse autoencoders inserted into pretrained vision-language models detect adversarial image attacks using only clean training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a sparse autoencoder trained solely to reconstruct activations from a frozen vision-language model will produce latent features that reliably flag whether an input image has been adversarially perturbed. This detection holds for attacks and image domains never seen during the autoencoder's training. A sympathetic reader cares because the method adds a lightweight safety layer to existing VLMs without requiring adversarial examples, retraining of the base model, or large computational cost. The approach therefore offers a practical route to hardening deployed systems that currently remain vulnerable to such perturbations.

Core claim

By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples.

What carries the argument

The sparse autoencoder module inserted at selected layers of the VLM, trained only to minimize reconstruction error on clean activations.

If this is right

Detection works in in-domain, cross-domain, and cross-attack settings with large gains over baselines in cross-domain cases.
Signals from multiple layers can be combined to increase robustness and stability of the detector.
The method requires no adversarial training data and adds only minimal inference overhead.
The same SAE can serve as a plug-and-play module across different pretrained VLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reconstruction-only training might surface features useful for detecting other forms of input corruption such as natural distribution shifts.
Because the SAE is trained independently of the VLM weights, it could be swapped or updated without touching the underlying model.
One could test whether the sparse codes also correlate with the magnitude or type of perturbation to enable graded rather than binary alerts.

Load-bearing premise

Sparse features learned solely from reconstruction on clean VLM activations will naturally encode information that distinguishes adversarial perturbations without any exposure to adversarial examples.

What would settle it

A test set of clean and adversarially perturbed images from a new domain or attack type on which the SAE-based classifier performs at chance level.

Figures

Figures reproduced from arXiv: 2605.07447 by Daisuke Kawahara, Hao Wang, Lawrence B. Hsieh, Pengfei Wei, Yiqun Sun.

**Figure 2.** Figure 2: Preliminary experimental results. Both experiments employ FOA-Attack as the adversarial [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Shared feature overlap across datasets under different attacks, illustrated by Venn diagrams [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Shared feature overlap across attack methods, illustrated by Venn diagrams of the top-256 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Distributions of the number of activated features for clean and adversarial images, averaged [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: F1 vs. number of adversarial samples for feature selection, under cross-domain (Medical → NIPS17) at projection-mlp2. 6 Conclusion In this work, we introduced SAEgis, a simple and effective framework for adversarial attack detection in vision-language models by leveraging sparse autoencoders as plug-and-play modules. Without requiring adversarial training, SAEgis identifies attack-relevant features and ac… view at source ↗

read the original abstract

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper inserts a sparse autoencoder into a VLM, trains it only on clean reconstruction, and claims the resulting features let you classify adversarial images even across domains and attacks.

read the letter

The main takeaway is that SAEgis adds a lightweight SAE module to a pretrained VLM and trains it with ordinary reconstruction loss on clean activations. The sparse latents are then used to flag whether an input image has been adversarially perturbed, with reported gains in cross-domain and cross-attack settings when signals from multiple layers are combined. No adversarial examples are used to train the SAE itself, and the method is presented as plug-and-play with low overhead.

Referee Report

3 major / 2 minor

Summary. The paper proposes SAEgis, a lightweight framework for adversarial attack detection in vision-language models (VLMs). It inserts a sparse autoencoder (SAE) module into a pretrained VLM, trains the SAE using only standard reconstruction objectives on clean activations, and uses the resulting sparse latent features to classify inputs as clean or adversarially perturbed. The authors report strong performance across in-domain, cross-domain, and cross-attack settings, with further gains from combining multi-layer signals, and emphasize that the approach requires no additional adversarial training or VLM fine-tuning.

Significance. If the central empirical claims hold, the work offers a practical plug-and-play safety mechanism for VLMs that exploits the inductive bias of SAEs trained solely on clean data. The reported cross-domain generalization improvements and minimal overhead would be valuable for real-world VLM deployments. The novelty of applying SAEs in this manner is noted, though the strength depends on whether the detection mechanism truly avoids any exposure to adversarial examples.

major comments (3)

[§3] §3 (Method): The description of the detection pipeline must clarify whether the final classification step (e.g., threshold on reconstruction error, sparsity statistics, or a learned detector) uses any supervised training on labeled adversarial examples. The abstract asserts 'no additional adversarial training' and that features 'naturally capture' attack signals, but if any component fits parameters on attack data, this contradicts the central claim that clean-only reconstruction suffices.
[§4.2] §4.2 (Cross-domain experiments): The performance gains over baselines are load-bearing for the generalization claim. Specific metrics (e.g., AUROC or F1) and exact baseline implementations must be reported with statistical significance; without them, it is unclear whether the SAE sparsity level alone drives the improvement or whether post-hoc choices in feature selection are involved.
[§4.3] §4.3 (Cross-attack results): The claim that sparse features generalize to unseen attack types rests on the assumption that reconstruction-trained latents separate clean and perturbed manifolds. The paper should include an ablation showing that detection performance degrades gracefully when the SAE is trained on a narrower clean distribution, to test whether the 'natural capture' property holds beyond the reported settings.

minor comments (2)

[Abstract] Abstract and §1: The title uses 'firewalls' metaphorically; a brief clarification of the precise threat model (e.g., whether detection occurs at inference time before VLM processing) would improve readability.
[§5] §5 (Discussion): Add explicit limitations on failure cases, such as very low-magnitude perturbations or domain shifts in the clean training distribution for the SAE.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped strengthen the clarity and empirical rigor of the manuscript. We address each major comment point by point below, with revisions incorporated where the suggestions improve the presentation of our claims.

read point-by-point responses

Referee: §3 (Method): The description of the detection pipeline must clarify whether the final classification step (e.g., threshold on reconstruction error, sparsity statistics, or a learned detector) uses any supervised training on labeled adversarial examples. The abstract asserts 'no additional adversarial training' and that features 'naturally capture' attack signals, but if any component fits parameters on attack data, this contradicts the central claim that clean-only reconstruction suffices.

Authors: We agree that the method section requires explicit clarification on this point. No supervised training on adversarial examples occurs at any stage. The SAE is trained solely on clean activations using the standard reconstruction objective. Detection relies on simple thresholding of reconstruction error and sparsity statistics in the latent space, with thresholds determined exclusively from a clean validation set. No learned classifier or parameters are fit on attack data. We have revised §3 to include a precise description of the full detection pipeline and to restate that all training and hyperparameter choices use only clean data. revision: yes
Referee: §4.2 (Cross-domain experiments): The performance gains over baselines are load-bearing for the generalization claim. Specific metrics (e.g., AUROC or F1) and exact baseline implementations must be reported with statistical significance; without them, it is unclear whether the SAE sparsity level alone drives the improvement or whether post-hoc choices in feature selection are involved.

Authors: We accept this critique and have expanded the reporting in the revised §4.2. New tables now provide AUROC and F1 scores (with mean and standard deviation over five independent runs) for SAEgis and all baselines in every cross-domain setting. We include p-values from paired t-tests to establish statistical significance of the reported gains. We also clarify that no post-hoc feature selection is performed; all sparse latents are used as produced by the SAE, and baseline implementations follow the original papers exactly. These additions confirm that the improvements stem from the SAE-induced sparsity rather than implementation choices. revision: yes
Referee: §4.3 (Cross-attack results): The claim that sparse features generalize to unseen attack types rests on the assumption that reconstruction-trained latents separate clean and perturbed manifolds. The paper should include an ablation showing that detection performance degrades gracefully when the SAE is trained on a narrower clean distribution, to test whether the 'natural capture' property holds beyond the reported settings.

Authors: This suggestion strengthens the evidence for the core mechanism. We have added the requested ablation in the revised §4.3: the SAE is retrained on randomly subsampled clean data at 25%, 50%, and 75% of the original clean training set size. The new results (presented in an additional table) show graceful degradation on unseen attack types, with performance remaining competitive even at the 50% level. This supports that the manifold separation arises from the reconstruction objective applied to a representative clean distribution. We thank the referee for prompting this analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper proposes inserting an SAE into a pretrained VLM, training it solely with standard reconstruction loss on (clean) activations, and then using the resulting sparse features for downstream classification of adversarial inputs. This is an empirical pipeline whose central claim—that the features 'naturally capture attack-relevant signals'—is tested via experiments on in-domain, cross-domain, and cross-attack settings rather than derived mathematically. No equations or steps reduce a claimed prediction to a fitted parameter or self-citation by construction. The method requires no adversarial examples during SAE training, and results are presented as observed performance rather than forced by definition. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reconstruction-trained sparse features will capture attack signals as a byproduct, plus standard SAE training assumptions such as appropriate sparsity levels.

free parameters (1)

SAE sparsity level
Controls how many features are active; chosen to enable capture of attack-relevant signals but not specified in abstract.

axioms (1)

domain assumption Sparse latent features learned via reconstruction on VLM activations naturally encode information relevant to adversarial perturbations.
This is the load-bearing premise allowing detection without adversarial-specific training or data.

pith-pipeline@v0.9.0 · 5546 in / 1366 out tokens · 58849 ms · 2026-05-11T01:45:53.628672+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
score_i(x) = max a_i,t * log(1 + count), attack_score_i = E_adv - E_clean, threshold from clean quantile

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

2026 , eprint=

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding , author=. 2026 , eprint=

work page 2026
[2]

2026 , eprint=

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

work page 2026
[3]

2025 , eprint=

NVIDIA Nemotron Nano V2 VL , author=. 2025 , eprint=

work page 2025
[4]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

work page 2026
[5]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[6]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025
[7]

2016 , eprint=

VQA: Visual Question Answering , author=. 2016 , eprint=

work page 2016
[8]

2020 , eprint=

Image Captioning: Transforming Objects into Words , author=. 2020 , eprint=

work page 2020
[9]

2020 , eprint=

Referring Expression Comprehension: A Survey of Methods and Datasets , author=. 2020 , eprint=

work page 2020
[10]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2024 , month=. doi:10.1609/aaai.v38i2.27888 , abstractNote=

work page doi:10.1609/aaai.v38i2.27888 2024
[11]

L lama V -o1: Rethinking Step-by-step Visual Reasoning in LLM s

Thawakar, Omkar and Dissanayake, Dinura and More, Ketan Pravin and Thawkar, Ritesh and Heakl, Ahmed and Ahsan, Noor and Li, Yuhao and Zumri, Ilmuz Zaman Mohammed and Lahoud, Jean and Anwer, Rao Muhammad and Cholakkal, Hisham and Laptev, Ivan and Shah, Mubarak and Khan, Fahad Shahbaz and Khan, Salman. L lama V -o1: Rethinking Step-by-step Visual Reasoning ...

work page doi:10.18653/v1/2025.findings-acl.1247 2025
[12]

2026 , eprint=

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation , author=. 2026 , eprint=

work page 2026
[13]

2026 , eprint=

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models , author=. 2026 , eprint=

work page 2026
[14]

2025 , eprint=

A Survey on Vision-Language-Action Models for Autonomous Driving , author=. 2025 , eprint=

work page 2025
[15]

2025 , eprint=

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment , author=. 2025 , eprint=

work page 2025
[16]

2023 , eprint=

How Robust is Google's Bard to Adversarial Image Attacks? , author=. 2023 , eprint=

work page 2023
[17]

On Evaluating Adversarial Robustness of Large Vision-Language Models , url =

Zhao, Yunqing and Pang, Tianyu and Du, Chao and Yang, Xiao and LI, Chongxuan and Cheung, Ngai-Man (Man) and Lin, Min , booktitle =. On Evaluating Adversarial Robustness of Large Vision-Language Models , url =

work page
[18]

2025 , eprint=

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90 author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models , author=. 2025 , eprint=

work page 2025
[22]

2024 , eprint=

Efficient Generation of Targeted and Transferable Adversarial Examples for Vision-Language Models Via Diffusion Models , author=. 2024 , eprint=

work page 2024
[23]

2026 , eprint=

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting , author=. 2026 , eprint=

work page 2026
[24]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025
[26]

2024 , eprint=

MirrorCheck: Efficient Adversarial Defense for Vision-Language Models , author=. 2024 , eprint=

work page 2024
[27]

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions , url=

Zhang, Yudong and Xie, Ruobing and Chen, Jiansheng and Sun, Xingwu and Wang, Yu , year=. PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions , url=. doi:10.1145/3664647.3685510 , booktitle=

work page doi:10.1145/3664647.3685510
[28]

2024 , eprint=

Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector , author=. 2024 , eprint=

work page 2024
[29]

doi: 10.1038/381607a0

Olshausen, Bruno A. and Field, David J. , title =. Nature , year =. doi:10.1038/381607a0 , url =

work page doi:10.1038/381607a0
[30]

2011 , howpublished =

Ng, Andrew , title =. 2011 , howpublished =

work page 2011
[31]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Li, Linjie and Lei, Jie and Gan, Zhe and Liu, Jingjing , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

work page 2021
[32]

Human-Adversarial Visual Question Answering , url =

Sheng, Sasha and Singh, Amanpreet and Goswami, Vedanuj and Magana, Jose and Thrush, Tristan and Galuba, Wojciech and Parikh, Devi and Kiela, Douwe , booktitle =. Human-Adversarial Visual Question Answering , url =

work page
[33]

2021 , eprint=

Controlled Caption Generation for Images Through Adversarial Attacks , author=. 2021 , eprint=

work page 2021
[34]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021
[35]

2022 , editor =

Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven , booktitle =. 2022 , editor =

work page 2022
[36]

2023 , eprint=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. 2023 , eprint=

work page 2023
[37]

2022 , eprint=

Frequency Domain Model Augmentation for Adversarial Attack , author=. 2022 , eprint=

work page 2022
[38]

2024 , eprint=

Rethinking Model Ensemble in Transfer-based Adversarial Attacks , author=. 2024 , eprint=

work page 2024
[39]

2021 , eprint=

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=

work page 2021
[40]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

work page 2020
[41]

2022 , eprint=

High-Resolution Image Synthesis with Latent Diffusion Models , author=. 2022 , eprint=

work page 2022
[42]

Machine Learning , year =

Corinna Cortes and Vladimir Vapnik , title =. Machine Learning , year =. doi:10.1007/BF00994018 , url =

work page doi:10.1007/bf00994018
[43]

2022 , eprint=

Extracting Latent Steering Vectors from Pretrained Language Models , author=. 2022 , eprint=

work page 2022
[44]

2025 , eprint=

HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States , author=. 2025 , eprint=

work page 2025
[45]

PromptGuard: Safeguarding large vision-language models via adversarial prompt tuning , journal =

Changbao Zhou and Hengshan Yue and Ming Yan and Xiaohui Wei , keywords =. PromptGuard: Safeguarding large vision-language models via adversarial prompt tuning , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.knosys.2026.115498 , url =

work page doi:10.1016/j.knosys.2026.115498 2026
[46]

2021 , eprint=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. 2021 , eprint=

work page 2021
[47]

2025 , eprint=

Route Sparse Autoencoder to Interpret Large Language Models , author=. 2025 , eprint=

work page 2025
[48]

2017 , howpublished =

Alex K and Ben Hamner and Ian Goodfellow , title =. 2017 , howpublished =

work page 2017
[49]

2023 , eprint=

Visual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[50]

2024 , eprint=

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , author=. 2024 , eprint=

work page 2024
[51]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[52]

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs , url =

Meng, Lingchen and Yang, Jianwei and Tian, Rui and Dai, Xiyang and Wu, Zuxuan and Gao, Jianfeng and Jiang, Yu-Gang , booktitle =. DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs , url =. doi:10.52202/079017-0739 , editor =

work page doi:10.52202/079017-0739
[53]

2025 , eprint=

FineVision: Open Data Is All You Need , author=. 2025 , eprint=

work page 2025