hub

JailbreakBench : An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al · 2024 · arXiv 2404.01318

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

cs.CR · 2026-05-02 · conditional · novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

Jailbreaking Black Box Large Language Models in Twenty Queries

cs.LG · 2023-10-12 · conditional · novelty 6.0

PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

cs.LG · 2023-10-05 · accept · novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

Re-Triggering Safeguards within LLMs for Jailbreak Detection

cs.CR · 2026-05-11 · unverdicted · novelty 5.0

Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

cs.CR · 2026-05-04 · accept · novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

Cross-Lingual Jailbreak Detection via Semantic Codebooks

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.

Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing

cs.CR · 2026-04-22 · unverdicted · novelty 5.0

Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

citing papers explorer

Showing 17 of 17 citing papers.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 9
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 7
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring cs.CR · 2026-05-09 · unverdicted · none · ref 6
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack success rates.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming cs.CL · 2026-05-04 · unverdicted · none · ref 2
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 125
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 80
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks cs.AI · 2026-05-11 · unverdicted · none · ref 2
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models cs.CR · 2026-05-02 · conditional · none · ref 6
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints cs.AI · 2026-04-14 · unverdicted · none · ref 6
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 11
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Jailbreaking Black Box Large Language Models in Twenty Queries cs.LG · 2023-10-12 · conditional · none · ref 32
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 27
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection cs.CR · 2026-05-11 · unverdicted · none · ref 3
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts cs.CR · 2026-05-04 · accept · none · ref 23
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
Cross-Lingual Jailbreak Detection via Semantic Codebooks cs.CL · 2026-04-28 · unverdicted · none · ref 3
Semantic similarity to an English jailbreak codebook detects cross-lingual attacks with high accuracy on curated benchmarks but shows poor separability on diverse unsafe prompts.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing cs.CR · 2026-04-22 · unverdicted · none · ref 34
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs cs.LG · 2026-04-17 · unverdicted · none · ref 5
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

JailbreakBench : An open robustness benchmark for jailbreaking large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer