pith. sign in

arxiv: 2512.12069 · v3 · submitted 2025-12-12 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

Pith reviewed 2026-05-16 22:25 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG
keywords jailbreak detectionlarge vision-language modelsrepresentational contrastive scoringmultimodal attacksanomaly detectioninternal representationsgeneralization to unseen attacks
0
0 comments X

The pith

Internal representations of vision-language models contain signals that a lightweight contrastive score can use to detect jailbreaks even for unseen attack types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that effective jailbreak detection for large vision-language models does not require targeting specific attack patterns or incurring high computational costs. Instead, the strongest safety signals lie in the geometry of the model's own internal representations. A lightweight projection learned on safety-critical layers separates benign inputs from malicious ones, producing a contrastive score that distinguishes true intent from ordinary distribution shifts. This yields two concrete detectors, MCD and KCD, that outperform prior methods when tested on attack types withheld from training. The result matters because evolving multimodal threats render pattern-specific defenses obsolete while over-rejecting benign queries erodes usability.

Core claim

Representational Contrastive Scoring inspects the internal geometry of LVLM representations and learns a lightweight projection that maximally separates benign and malicious inputs in safety-critical layers, yielding a contrastive score that differentiates true malicious intent from mere distribution shift; its instantiations MCD and KCD achieve state-of-the-art performance on an evaluation protocol that tests generalization to unseen attack types.

What carries the argument

Representational Contrastive Scoring (RCS): a framework that applies a learned lightweight projection to the internal representations of safety-critical layers to compute a contrastive score separating benign from malicious inputs.

Load-bearing premise

The most potent safety signals reside inside the LVLM's own internal representations, and a lightweight projection can separate benign and malicious inputs in safety-critical layers without overfitting to the attacks seen during training.

What would settle it

Run MCD or KCD on a fresh collection of multimodal jailbreak attacks never encountered in training and measure whether detection accuracy remains high while false-positive rates on ordinary benign image-text pairs stay low; a sharp drop in either metric would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2512.12069 by Hao Li, Ning Zhang, Peichun Hua, Shanghao Shi, Zhiyuan Yu.

Figure 1
Figure 1. Figure 1: Overview of our detection framework propose a principled, data-driven methodology to pinpoint the optimal layer before training a full detector. Our approach is founded on a central hy￾pothesis: the layers with the highest downstream detection performance are those where benign and malicious prompt representations are most geomet￾rically separable. To quantify this separability, we use the SGXSTest dataset… view at source ↗
Figure 2
Figure 2. Figure 2: Detection performance of KCD vs. SafeMT [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detection performance of MCD vs. SafeMT [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA visualization revealing the underlying separability structure across different evaluation scenarios. [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of clustering strategies for MCD on layer 16 (LLaVA) and layer 20 (InternVL). [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Score distribution of the original JailDAM method in a simplified evaluation setting. The plot shows that [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Score distribution of JailDAM in our robust evaluation setting, which introduces out-of-distribution (OOD) [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study on the k-value hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study on the k-value hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Correlation between layer discriminative [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Correlation between layer discriminative [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Detection performance of KCD vs. SafeMT [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Detection performance of MCD vs. SafeMT [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
read the original abstract

Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse unseen benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere distribution shift. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the internal representations, offering a practical path towards safer LVLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Representational Contrastive Scoring (RCS) for detecting jailbreaks in Large Vision-Language Models. It claims that the most potent safety signals lie in the internal geometry of LVLM representations; a lightweight projection learned to separate benign from malicious inputs in safety-critical layers yields a contrastive score (instantiated as MCD and KCD) that achieves state-of-the-art generalization on a protocol that holds out unseen attack types, while remaining computationally lightweight.

Significance. If the central claim holds, the work is significant because it shows that simple, interpretable statistical methods applied directly to internal LVLM activations can outperform both pattern-specific defenses and heavy anomaly detectors on a generalization-focused benchmark. This offers a practical route to efficient, deployable jailbreak detection without requiring attack-specific engineering or large auxiliary models.

major comments (2)
  1. [§3.2] §3.2 (Projection Learning): The optimization objective for the lightweight projection is not fully specified (loss function, supervision signal, regularization, or whether only benign data are used). This detail is load-bearing for the generalization claim, because without it one cannot determine whether the reported separation on unseen attacks reflects invariant safety geometry or partial overlap between the attacks used to fit the projection and those in the test set.
  2. [§5.1] §5.1 (Evaluation Protocol): The paper must explicitly list which attack families are present during projection learning versus those held out for the 'unseen' test split, together with the exact train/test attack-type partition. The current description leaves open the possibility that the SOTA numbers partly reflect distribution overlap rather than true out-of-distribution robustness.
minor comments (2)
  1. [Abstract / §2] The abstract and §2 refer to 'safety-critical layers' without stating the selection criterion (e.g., layer index, activation norm, or empirical validation).
  2. [§3.3] Notation for the contrastive score (Eq. 3 or equivalent) should be accompanied by a short derivation or pseudocode to clarify how the Mahalanobis or k-NN distance is computed after the projection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional methodological and protocol details will strengthen the paper. We address each point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Projection Learning): The optimization objective for the lightweight projection is not fully specified (loss function, supervision signal, regularization, or whether only benign data are used). This detail is load-bearing for the generalization claim, because without it one cannot determine whether the reported separation on unseen attacks reflects invariant safety geometry or partial overlap between the attacks used to fit the projection and those in the test set.

    Authors: We agree that the optimization objective must be stated explicitly for reproducibility and to support the generalization claims. In the revised manuscript we will add a precise description of the projection learning procedure, including the contrastive loss function, the use of labeled benign and malicious examples as the supervision signal, the regularization terms applied, and confirmation that both classes are used during training. This will make clear that the projection is fit on a training partition of known attacks while evaluation measures performance on held-out attack families. revision: yes

  2. Referee: [§5.1] §5.1 (Evaluation Protocol): The paper must explicitly list which attack families are present during projection learning versus those held out for the 'unseen' test split, together with the exact train/test attack-type partition. The current description leaves open the possibility that the SOTA numbers partly reflect distribution overlap rather than true out-of-distribution robustness.

    Authors: We will revise §5.1 to include an explicit table (or enumerated list) that states every attack family used when learning the projection versus every family held out for the unseen test split, together with the exact train/test partition. This will eliminate ambiguity and allow readers to verify that the reported results reflect generalization to truly unseen attack types rather than partial overlap. revision: yes

Circularity Check

0 steps flagged

No circularity: RCS derives contrastive scores from learned projections on internal representations

full rationale

The paper introduces Representational Contrastive Scoring by inspecting LVLM internal geometry and learning a lightweight projection to separate benign vs. malicious inputs in safety-critical layers, then computing a simple contrastive score. This construction is not self-referential: the projection parameters are fitted from data (not defined in terms of the target score), the score is a downstream statistic on the projected representations, and the generalization claims are evaluated on a held-out protocol for unseen attack types. No equations reduce the output to the inputs by construction, no self-citations are load-bearing for the core claim, and no ansatz or uniqueness theorem is smuggled in. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claim implicitly rests on the unverified premise that internal representations contain separable safety signals learnable via lightweight projection.

pith-pipeline@v0.9.0 · 5545 in / 1164 out tokens · 39921 ms · 2026-05-16T22:25:31.844506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

    cs.CR 2026-05 accept novelty 6.0

    JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, and Vitaly Shmatikov. 2023. Abusing images and sounds for indir...

  2. [2]

    A realistic threat model for large language model jailbreaks

    Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828. Valentyn Boreiko, Alexander Panfilov, Vaclav V o- racek, Matthias Hein, and Jonas Geiping. 2024. An interpretable n-gram perplexity threat model for large language model jailbreaks.ArXiv preprint, abs/2410.16222. Patrick ...

  3. [3]

    In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 23–42

    Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE. Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, and Jindong Gu. 2024a. Llm jailbreak detection for (almost) free!ArXiv preprint, abs/2409.14558. Jianhui Chen, Xiaozhi Wang, Zijun Yao...

  4. [4]

    Scalable defense against in-the-wild jailbreaking attacks with safety context retrieval,

    Scalable defense against in-the-wild jailbreak- ing attacks with safety context retrieval.ArXiv preprint, abs/2505.15753. Yangyi Chen, Hongcheng Gao, Ganqu Cui, Fanchao Qi, Longtao Huang, Zhiyuan Liu, and Maosong Sun

  5. [5]

    Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

    Why should adversarial perturbations be im- perceptible? rethink the research paradigm in adver- sarial NLP. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Process- ing, pages 11222–11237, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Yilun Chen, Ami Wiesel, and Alfred O Hero. 2011. Robust shrin...

  6. [6]

    InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

    Safe RLHF: safe reinforcement learning from human feedback. InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, and Bo Li. 2025. Duoguard: A two-player rl-driven framework for multilingual llm guardrails.ArXiv preprint, abs/2502.05163....

  7. [7]

    Improving Reconstruction Autoencoder Out-of-distribution Detection with Mahalanobis Distance

    Improving reconstruction autoencoder out- of-distribution detection with mahalanobis distance. ArXiv preprint, abs/1812.02765. Xuefeng Du, Reshmi Ghosh, Robert Sim, Ahmed Salem, Vitor Carvalho, Emily Lawton, Yixuan Li, and Jack W Stokes. 2024. Vlmguard: Defending vlms against malicious prompts via unlabeled data. ArXiv preprint, abs/2410.00296. Xuefeng Du...

  8. [8]

    InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

    Agent smith: A single image can jailbreak one million multimodal LLM agents exponentially fast. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  9. [9]

    Prannaya Gupta, Le Qi Yau, Hao Han Low, I Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, and 1 others

    OpenReview.net. Prannaya Gupta, Le Qi Yau, Hao Han Low, I Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, and 1 others. 2024. Walledeval: A comprehensive safety evaluation toolkit for large language models. ArXiv preprint, abs/2408.03837. Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kriste...

  10. [10]

    Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kersting, and Patrick Schramowski

    Towards llm guardrails via sparse representa- tion steering.ArXiv preprint, abs/2503.16851. Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kersting, and Patrick Schramowski. 2024. Llava- guard: An open vlm-based framework for safeguard- ing vision datasets and models.ArXiv preprint, abs/2406.05113. Dan Hendrycks and Kevin Gimpel. 2017. A baseline fo...

  11. [11]

    A simple unified framework for detecting out- of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Pro- cessing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 7167–7177. Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, and Yunho M...

  12. [12]

    Fight back against jailbreaking via prompt ad- versarial tuning. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Yi Nian, Shenzhe Zhu, Yuehan Qin, Li Li, Ziyi Wang, Chaowei Xiao, and Yue Zhao. 2025. Jaildam: Jail- break d...

  13. [13]

    Steering Llama 2 via Contrastive Activation Addition

    Steering llama 2 via contrastive activation addition.ArXiv preprint, abs/2312.06681. Vladimir Pestov. 2013. Is the k-nn classifier in high dimensions affected by the curse of dimensional- ity?Computers & Mathematics with Applications, 65(10):1427–1437. Anirudh Phukan, Divyansh Divyansh, Harshit Kumar Morj, Vaishnavi Vaishnavi, Apoorv Saxena, and Koustava ...

  14. [14]

    Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

    Great, now write an article about that: The crescendo multi-turn llm jailbreak attack.ArXiv preprint, abs/2404.01833. Christian Schlarmann and Matthias Hein. 2023. On the adversarial robustness of multi-modal foundation models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 3677– 3685. Vikash Sehwag, Mung Chiang, and Pr...

  15. [15]

    InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11,

    Jailbreak in pieces: Compositional adversar- ial attacks on multi-modal language models. InThe Twelfth International Conference on Learning Rep- resentations, ICLR 2024, Vienna, Austria, May 7-11,

  16. [16]

    do anything now

    OpenReview.net. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Charac- terizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685. Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Gui...

  17. [17]

    InInternational Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, pages 20827–20840

    Out-of-distribution detection with deep near- est neighbors. InInternational Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, pages 20827–20840. PMLR. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsun...

  18. [18]

    Weiliang Zhao, Daniel Ben-Levi, Wei Hao, Junfeng Yang, and Chengzhi Mao

    Defending large language models against jailbreak attacks via layer-specific editing.ArXiv preprint, abs/2405.18166. Weiliang Zhao, Daniel Ben-Levi, Wei Hao, Junfeng Yang, and Chengzhi Mao. 2025c. Diversity helps jailbreak large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computationa...

  19. [19]

    jailbreak the jailbreak

    On evaluating adversarial robustness of large vision-language models. InAdvances in Neural In- formation Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng....

  20. [20]

    In-the-Wild

    identified in Figure 10. For instance, Layer 17, selected by the “In-the-Wild” strategy, yields marginally higher AUROC than Layer 16. Table 14: Top 3 discriminative layers identified by each dataset setup. The identified layers consistently fall within the optimal “sweet spot” range (Layers 14–20). Experiment Top 3 Layers (Score) SGXSTest (Baseline) L16 ...

  21. [21]

    The framework is designed to be model-agnostic and can be deployed on top of existing LVLMs

    follows the principles outlined in the original paper, creating a universal detection framework that identifies prompt-based attacks by analyzing the response consistency of a model to mutated inputs. The framework is designed to be model-agnostic and can be deployed on top of existing LVLMs. E.3.1 Detection Framework The core of JailGuard operates on the...

  22. [22]

    By default,N= 8

    Input Mutation: For a given input prompt (either text-only or multimodal), JailGuard generates a set of N variants using a selected mutation strategy. By default,N= 8

  23. [23]

    These responses are collected for consistency analy- sis

    Response Generation: Each of the N vari- ants is then passed to the target LVLM to ob- tain a corresponding set of responses. These responses are collected for consistency analy- sis

  24. [24]

    methods for analyzing the facial features of a person to predict their personality traits

    Divergence-Based Detection: The semantic similarity between all pairs of responses is cal- culated using a pre-trained ‘en_core_web_md’ model from the spaCy library. These similari- ties are used to construct a similarity matrix, which is then converted into a divergence ma- trix using Kullback-Leibler (KL) divergence (Zhang et al., 2023). If the maximum ...

  25. [25]

    The CAPTCHA in the image reads ’takrth 1863’

    The semantic embedding is dominated by this shared prefix, resulting inlow divergencebe- tween the refusal and the jail- break. JailGuard fails to detect the suffix difference. VizWiz Benign High Vari- ance Text:(Benign VQA request to read the blurry CAPTCHA) Response 1:“The CAPTCHA in the image reads ’takrth 1863’.” Response 2:“The captcha in the im- age...