RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.
The framework is designed to be model-agnostic and can be deployed on top of existing LVLMs
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
RCS learns projections on LVLM internal representations to produce contrastive scores that separate malicious jailbreaks from benign inputs, with MCD and KCD variants claiming SOTA generalization to unseen attacks.