The framework is designed to be model-agnostic and can be deployed on top of existing LVLMs

follows the principles outlined in the original paper, creating a universal detection framework that identifies prompt-based attacks by analyzing the response consistency of a mode

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

cs.CR · 2025-12-12 · unverdicted · novelty 6.0

RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.

citing papers explorer

Showing 1 of 1 citing paper.

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring cs.CR · 2025-12-12 · unverdicted · none · ref 21
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.

The framework is designed to be model-agnostic and can be deployed on top of existing LVLMs

fields

years

verdicts

representative citing papers

citing papers explorer