TukaBench extends JailbreakBench to African languages via human translation, cultural adaptation, curated prompts, and code-switching, finding lower refusal rates for culturally grounded prompts and surfacing comprehension and judging limitations.
A cross-language investigation into jailbreak attacks in large language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 2representative citing papers
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.
MLJailDe achieves 98.5% F1 on multilingual jailbreak detection by combining back-translation data augmentation, supervised contrastive loss, and imbalance-aware classification on a DeBERTa backbone.
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
citing papers explorer
-
TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages
TukaBench extends JailbreakBench to African languages via human translation, cultural adaptation, curated prompts, and code-switching, finding lower refusal rates for culturally grounded prompts and surfacing comprehension and judging limitations.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
Cross-Lingual Jailbreak Detection via Semantic Codebooks
Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.
-
One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection
MLJailDe achieves 98.5% F1 on multilingual jailbreak detection by combining back-translation data augmentation, supervised contrastive loss, and imbalance-aware classification on a DeBERTa backbone.
-
Sentra-Guard: A Real-Time Multilingual Defense Against Adversarial LLM Prompts
Sentra-Guard reports 99.96% detection of adversarial LLM prompts with AUC 1.00 and ASR of 0.004% using a hybrid SBERT-FAISS and transformer classifier architecture with multilingual translation and human feedback.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.