hub

Jailbreaking leading safety-aligned LLMs with simple adaptive attacks

Andriushchenko, M · 2024 · arXiv 2404.02151

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

cs.CR · 2026-05-04 · accept · novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.

Ethics Testing: Proactive Identification of Generative AI System Harms

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

cs.CR · 2026-04-14 · unverdicted · novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

cs.CR · 2026-03-23 · unverdicted · novelty 6.0

Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

cs.LG · 2023-10-05 · accept · novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

Re-Triggering Safeguards within LLMs for Jailbreak Detection

cs.CR · 2026-05-11 · unverdicted · novelty 5.0

Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.

SALLIE: Safeguarding Against Latent Language & Image Exploits

cs.CR · 2026-04-06 · unverdicted · novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

citing papers explorer

Showing 12 of 12 citing papers.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 23
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
GuardPhish: Securing Open-Source LLMs from Phishing Abuse cs.CR · 2026-04-19 · unverdicted · none · ref 19
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 112
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses cs.CR · 2026-05-04 · accept · none · ref 4
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption cs.CR · 2026-04-30 · unverdicted · none · ref 38
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 1
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
Ethics Testing: Proactive Identification of Generative AI System Harms cs.SE · 2026-04-23 · unverdicted · none · ref 6
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs cs.CR · 2026-04-14 · unverdicted · none · ref 5
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models cs.CR · 2026-03-23 · unverdicted · none · ref 33
Comic-based visual narratives achieve over 90% ensemble success rates on multiple MLLMs, outperforming text and random-image baselines while breaking existing safety methods and evaluators.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 21
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection cs.CR · 2026-05-11 · unverdicted · none · ref 2
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
SALLIE: Safeguarding Against Latent Language & Image Exploits cs.CR · 2026-04-06 · unverdicted · none · ref 3
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

Jailbreaking leading safety-aligned LLMs with simple adaptive attacks

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer