arxiv: 2604.25716 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Shirin Alanova , Bogdan Minko , Sabrina Sadiekh , Evgeniy Kokuykin

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords cross-lingual jailbreak detectionsemantic similaritycodebookLLM safetymultilingual embeddingstraining-free guardrail

0 comments

The pith

Semantic similarity to a fixed English jailbreak codebook detects translated attacks reliably on canonical benchmarks but degrades under distribution shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether comparing embeddings of queries in other languages to a static English collection of known jailbreak examples can catch malicious prompts without any language-specific training. On benchmarks built from standard jailbreak templates the method separates safe and unsafe inputs almost perfectly across languages and cuts attack success rates under tight false-positive limits. When the unsafe prompts become more varied and less template-like the separation weakens sharply. The result is a simple external guardrail that works for black-box models and requires only existing embedding tools.

Core claim

Semantic similarity between multilingual query embeddings and a fixed English codebook of jailbreak prompts achieves near-perfect separability (AUC up to 0.99) on curated benchmarks containing canonical jailbreak templates and yields substantial reductions in attack success rates under strict low-FPR constraints, yet separability falls to AUC approximately 0.60-0.70 and recall drops in the low-FPR regime on behaviorally diverse heterogeneous unsafe benchmarks.

What carries the argument

Fixed English codebook of jailbreak prompts against which multilingual query embeddings are compared via semantic similarity.

If this is right

Substantial reductions in absolute attack success rates occur under strict low-false-positive constraints on known template attacks.
The approach works across four languages and two translation pipelines without any language-specific adaptation or retraining.
It functions as a training-free external guardrail applicable to black-box target models including Qwen, Llama and GPT-3.5.
Two distinct performance regimes appear: reliable detection on curated canonical templates and markedly weaker detection on diverse unsafe content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Because detection depends on how well the codebook covers attack patterns, novel or creative jailbreaks outside those patterns may evade the method.
The performance gap between curated and heterogeneous benchmarks suggests that future evaluations should include wider varieties of unsafe content to test robustness.
Similar degradation across different embedding models indicates the limitation is tied more to the diversity of unsafe content than to any single embedding choice.

Load-bearing premise

Embedding models preserve semantic similarity between translated jailbreak prompts and the English codebook strongly enough to remain discriminative even when unsafe content becomes more varied.

What would settle it

Measure AUC and low-FPR recall on a fresh collection of translated prompts drawn from heterogeneous unsafe sources that use phrasing and behaviors outside the templates stored in the codebook; sustained high AUC would support the claim while a drop to the 0.60-0.70 range would falsify reliable cross-lingual generalization.

Figures

Figures reproduced from arXiv: 2604.25716 by Bogdan Minko, Evgeniy Kokuykin, Sabrina Sadiekh, Shirin Alanova.

**Figure 1.** Figure 1: Overview of the proposed cross-lingual semantic filtering framework. Incoming user input (in any language) is encoded using a multilingual embedding model and compared against a fixed English codebook of jailbreak prompts. If the maximum cosine similarity exceeds a predefined threshold, the query is blocked; otherwise, it is forwarded to the target LLM. The approach operates as a training-free external gua… view at source ↗

**Figure 2.** Figure 2: Representative ROC curves illustrating cross-lingual transfer (English vs. Russian; M2M100 vs. Google Translate) on Benchmark 1 (cleaner) and Benchmark 4 (noisier). canonical jailbreak templates remain sufficiently close to the English codebook even after translation. On Benchmark 4, however, recall collapses to single digits (3.3–6.4%) across all languages and embedding models. When FPR is tightly constra… view at source ↗

read the original abstract

Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC $\approx$ 0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic matching to a fixed English jailbreak codebook works on canonical translated templates but drops sharply on heterogeneous unsafe prompts, with the high AUC likely tied to template overlap rather than broad generalization.

read the letter

The main takeaway is that this training-free guardrail separates jailbreak attempts reliably only when the attacks stay close to known English templates; on more varied unsafe content the signal weakens to near-random levels across the tested embeddings and models. The evaluation covers four languages, two translation setups, four benchmarks, three embeddings, and three target LLMs, which is a solid sweep for an empirical paper. It usefully documents the two regimes with concrete AUC numbers and attack-success reductions at low false-positive thresholds, showing where the method adds value without retraining. The approach itself is simple and directly applicable to black-box APIs, which is a practical plus. The drop on heterogeneous benchmarks is reported plainly, which keeps the claims grounded. The soft spot is that the curated benchmarks appear to use canonical templates that could already sit near the codebook entries. Without details on codebook construction, size, or any overlap analysis, the near-perfect separability on those sets could be an in-distribution artifact rather than evidence of robust cross-lingual semantic transfer. The abstract also omits baseline comparisons, error bars, and statistical tests, so the magnitude of the reported gains is hard to judge precisely. The heterogeneous results are the more informative part but also limit how much this helps real deployments. This is worth reading for anyone building lightweight safety layers for non-English LLM use. The scale of the multi-model, multi-language tests makes it suitable for peer review, though referees would need to see codebook details and controls for template leakage before the claims can be taken as general.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a training-free method for detecting cross-lingual jailbreaks by measuring semantic similarity between embeddings of multilingual queries and a fixed English codebook of jailbreak prompts. It reports a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5), identifying two regimes: near-perfect separability (AUC up to 0.99) and reduced attack success rates on curated benchmarks containing canonical templates, versus markedly degraded separability (AUC ≈0.60-0.70) and lower recall at low FPR on heterogeneous, behaviorally diverse unsafe benchmarks.

Significance. If the empirical distinction between regimes holds after addressing methodological gaps, the work demonstrates that fixed English semantic codebooks can function as effective language-agnostic external guardrails for known jailbreak patterns without retraining or adaptation. The finding that performance degrades under distribution shift to heterogeneous content is a useful cautionary result for multilingual LLM safety, and the multi-dimensional evaluation (languages, embeddings, LLMs) provides a solid empirical foundation for follow-on research.

major comments (3)

[Abstract] Abstract: The central claim of reliable cross-lingual generalization on curated benchmarks rests on the assumption that high AUC reflects semantic similarity rather than overlap between the canonical jailbreak templates and the English codebook entries. No details are provided on codebook construction, size, or overlap checks, leaving open that separability may be by construction rather than demonstrating language-agnostic detection of unseen semantics.
[Evaluation] Evaluation sections: The reported AUC differences and attack-success reductions lack error bars, statistical significance tests, or explicit baseline comparisons (e.g., to language-specific detectors or non-semantic heuristics). This makes it difficult to assess whether the two-regime distinction is robust or an artifact of the post-hoc split and chosen thresholds.
[Abstract] Abstract and method description: The weakest assumption—that off-the-shelf multilingual embeddings preserve discriminative similarity to the fixed English codebook after translation and under distribution shift—is not ablated (e.g., via translation-quality controls or codebook-diversity tests). Without such checks, the degradation on heterogeneous benchmarks cannot be confidently attributed to semantic limits rather than embedding or benchmark artifacts.

minor comments (2)

[Abstract] The abstract states 'substantial reductions in absolute attack success rates' under low-FPR constraints but provides no numerical values or exact FPR thresholds used.
Notation for AUC, FPR, and recall should be introduced consistently in the main text if not assumed standard.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional transparency and controls will strengthen the manuscript. We address each major comment below and will incorporate revisions to improve methodological clarity and rigor while preserving the core empirical findings on the two performance regimes.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of reliable cross-lingual generalization on curated benchmarks rests on the assumption that high AUC reflects semantic similarity rather than overlap between the canonical jailbreak templates and the English codebook entries. No details are provided on codebook construction, size, or overlap checks, leaving open that separability may be by construction rather than demonstrating language-agnostic detection of unseen semantics.

Authors: We agree that explicit details on codebook construction are required to substantiate the claim. In the revised manuscript we will add a dedicated paragraph in the method section describing the codebook assembly process (sourced from established public collections of English jailbreak templates), its size, and the results of overlap analyses (both lexical and semantic) with the test sets. These additions will demonstrate that separability on curated benchmarks is not due to direct template overlap. revision: yes
Referee: [Evaluation] Evaluation sections: The reported AUC differences and attack-success reductions lack error bars, statistical significance tests, or explicit baseline comparisons (e.g., to language-specific detectors or non-semantic heuristics). This makes it difficult to assess whether the two-regime distinction is robust or an artifact of the post-hoc split and chosen thresholds.

Authors: We will strengthen the evaluation section by adding bootstrap-derived confidence intervals for all AUC and attack-success metrics, along with appropriate statistical tests (e.g., DeLong tests for AUC comparisons and McNemar tests for attack-success differences) between the two regimes. We will also include baseline comparisons against simple non-semantic heuristics such as keyword presence and prompt-length filters. We will further clarify that the regime distinction was defined a priori by benchmark properties (canonical templates versus heterogeneous user-generated unsafe prompts) rather than optimized post hoc. revision: yes
Referee: [Abstract] Abstract and method description: The weakest assumption—that off-the-shelf multilingual embeddings preserve discriminative similarity to the fixed English codebook after translation and under distribution shift—is not ablated (e.g., via translation-quality controls or codebook-diversity tests). Without such checks, the degradation on heterogeneous benchmarks cannot be confidently attributed to semantic limits rather than embedding or benchmark artifacts.

Authors: We will add the requested ablations to the revised paper. These will include (1) translation-quality controls reporting embedding cosine similarity between source and translated prompts, (2) codebook-diversity tests measuring performance sensitivity to codebook size and composition, and (3) cross-model consistency checks on how well jailbreak semantics are preserved after translation. While these controls will help isolate semantic factors from artifacts, we note that completely disentangling all possible confounds in multilingual embedding spaces remains inherently difficult. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on external benchmarks

full rationale

The paper describes a training-free method that embeds queries with off-the-shelf multilingual models and compares them to a fixed English jailbreak codebook. All reported results (AUC values, attack success rates) come from direct evaluation on four external safety benchmarks across languages and models. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the derivation chain. The central claim rests on observable separability under the stated experimental conditions rather than any reduction to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that current multilingual embedding models encode jailbreak semantics in a transferable way; no free parameters or new entities are introduced.

axioms (1)

domain assumption Multilingual embedding models preserve semantic similarity between translated jailbreak prompts and English examples sufficiently for detection.
Directly invoked by the method of comparing query embeddings to the fixed English codebook without language-specific adaptation.

pith-pipeline@v0.9.0 · 5528 in / 1238 out tokens · 48484 ms · 2026-05-07T16:17:36.434248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Embedding-based classifiers can detect prompt injection attacks,

Md. Ahsan Ayub and Subhabrata Majumdar. Embedding-based classifiers can detect prompt injection attacks.arXiv preprint arXiv:2410.22284, 2024

work page arXiv 2024
[2]

Bell, Eduardo Sánchez, David Dale, Pontus Stenetorp, Mikel Artetxe, and Marta R

Samuel J. Bell, Eduardo Sánchez, David Dale, Pontus Stenetorp, Mikel Artetxe, and Marta R. Costa-jussà. Translate, then detect: Leveraging machine transla- tion for cross-lingual toxicity classification.arXiv preprint arXiv:2509.14493, 2025

work page arXiv 2025
[3]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym An- driushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

work page internal anchor Pith review arXiv 2024
[4]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE-M3-embedding: Multi-lingual, multi-functionality, multi- granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024

work page internal anchor Pith review arXiv 2024
[5]

Improved large language model jailbreak detection via pretrained embeddings.arXiv preprint arXiv:2412.01547, 2024

Erick Galinkin and Martin Sablotny. Improved large language model jailbreak detection via pretrained embeddings.arXiv preprint arXiv:2412.01547, 2024

work page arXiv 2024
[6]

AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. AEGIS: Online adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993, 2024

work page arXiv 2024
[7]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review arXiv 2021
[8]

Prompt injection safety dataset

Jayavibhav. Prompt injection safety dataset. https://huggingface.co/dat asets/jayavibhav/prompt-injection-safety, 2024. Accessed: 2026

2024
[9]

arXiv preprint arXiv:2410.22770 , year=

Hao Li and Xiaogeng Liu. InjecGuard: Benchmarking and mitigating over- defense in prompt injection guardrail models.arXiv preprint arXiv:2410.22770, 2025

work page arXiv 2025
[10]

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, and Yinxing Xue. A cross-language investigation into jailbreak attacks in large language models.arXiv preprint arXiv:2401.16765, 2024

work page arXiv 2024
[11]

What matters for safety alignment?arXiv preprint arXiv:2601.03868, 2026

Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, and Mingxuan Yuan. What matters for safety alignment?arXiv preprint arXiv:2601.03868, 2026

work page arXiv 2026
[12]

Prompt-guard-86m

Meta. Prompt-guard-86m. https://huggingface.co/meta-llama/Promp t-Guard-86M, 2024. Accessed: 2026

2024
[13]

Examining multilin- gual embedding models cross-lingually through LLM-generated adversarial examples.arXiv preprint arXiv:2502.08638, 2025

Andrianos Michail, Simon Clematide, and Rico Sennrich. Examining multilin- gual embedding models cross-lingually through LLM-generated adversarial examples.arXiv preprint arXiv:2502.08638, 2025. 13

work page arXiv 2025
[14]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

work page internal anchor Pith review arXiv 2022
[15]

Lost in alignment: A survey on cross-lingual alignment methods for contextualized representation.ACM Computing Surveys, 58(5), 2025

Filippo Pallucchini, Lorenzo Malandri, Fabio Mercorio, and Mario Mezzan- zanica. Lost in alignment: A survey on cross-lingual alignment methods for contextualized representation.ACM Computing Surveys, 58(5), 2025

2025
[16]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290, 2024

work page internal anchor Pith review arXiv 2024
[17]

SPIRE: Detecting prompt injection in zero-day using semantic matching

Segal. SPIRE: Detecting prompt injection in zero-day using semantic matching. Engineering @ ActiveFence Blog, 2025. URL https://engineering.acti vefence.com

2025
[18]

The language barrier: Dissecting safety challenges of LLMs in multilingual contexts

Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of LLMs in multilingual contexts. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2668–2680, Bangkok, Thailand, 2024. Association for Computational Li...

2024
[19]

The state of multilingual llm safety research: From measuring the language gap to mitigating it

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen Bach, and Julia Kreutzer. The state of multilingual llm safety research: From measuring the language gap to mitigating it. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15856–15871, 2025

2025
[20]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3Guard technical report.arXiv preprint arXiv:2510.14276, 2025. 14 A Dataset Statistics Table 6:Datasets details by classes Dataset Total Safe Unsafe Lang. jayavibhav/prompt-injection-safety 60,000 27,971 32,029 EN xTRam...

work page internal anchor Pith review arXiv 2025