hub

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong · 2023 · cs.LG · arXiv 2310.08419

31 Pith papers cite this work. Polarity classification is still indexing.

31 Pith papers citing it

open full Pith review browse 31 citing papers arXiv PDF

abstract

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

Attention Is Where You Attack

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

cs.AI · 2026-04-27 · unverdicted · novelty 7.0

PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based white-box attacks on harmful-behavior benchmarks.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

cs.CR · 2026-04-10 · accept · novelty 7.0

RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Position: AI Security Policy Should Target Systems, Not Models

cs.CR · 2026-05-10 · unverdicted · novelty 6.0

Coordinated swarms of small open LLMs achieve frontier-model jailbreaks and full vulnerability recovery at zero cost, demonstrating that system scaffolds enable capabilities previously thought to require restricted large models.

LLM-Agnostic Semantic Representation Attack

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom human code.

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

cs.CR · 2026-05-05 · unverdicted · novelty 6.0

Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.

A Theoretical Game of Attacks via Compositional Skills

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stronger empirical performance.

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.

Estimating Tail Risks in Language Model Output Distributions

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

cs.CR · 2026-04-17 · unverdicted · novelty 6.0

Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs

cs.LG · 2026-04-12 · unverdicted · novelty 6.0

LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.

Conflicts Make Large Reasoning Models Vulnerable to Attacks

cs.CR · 2026-04-10 · conditional · novelty 6.0

Conflicts between alignment objectives or dilemmas increase attack success rates on LRMs by shifting and overlapping safety and functional neural representations.

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

citing papers explorer

Showing 31 of 31 citing papers.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 163 · internal anchor
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 10 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? cs.CR · 2026-04-07 · unverdicted · full · ref 4 · internal anchor
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming cs.CL · 2026-05-04 · unverdicted · none · ref 3 · internal anchor
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Attention Is Where You Attack cs.CR · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Adaptive Prompt Embedding Optimization for LLM Jailbreaking cs.AI · 2026-04-27 · unverdicted · none · ref 4 · internal anchor
PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based white-box attacks on harmful-behavior benchmarks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 60 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming cs.CL · 2026-04-21 · unverdicted · none · ref 47 · internal anchor
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory cs.LG · 2026-04-14 · unverdicted · none · ref 2 · internal anchor
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff
Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward cs.CR · 2026-04-10 · accept · none · ref 1 · internal anchor
RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 124 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Position: AI Security Policy Should Target Systems, Not Models cs.CR · 2026-05-10 · unverdicted · none · ref 4 · internal anchor
Coordinated swarms of small open LLMs achieve frontier-model jailbreaks and full vulnerability recovery at zero cost, demonstrating that system scaffolds enable capabilities previously thought to require restricted large models.
LLM-Agnostic Semantic Representation Attack cs.CL · 2026-05-09 · unverdicted · none · ref 52 · internal anchor
SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours cs.AI · 2026-05-05 · unverdicted · none · ref 14 · internal anchor
An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom human code.
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis cs.CR · 2026-05-05 · unverdicted · none · ref 8 · internal anchor
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
A Theoretical Game of Attacks via Compositional Skills cs.CL · 2026-05-01 · unverdicted · none · ref 7 · internal anchor
A theoretical attacker-defender game in LLM adversarial prompting yields a best-response attack related to existing methods, reveals attacker advantages at equilibrium, and derives a provably optimal defense with stronger empirical performance.
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation cs.LG · 2026-04-25 · unverdicted · none · ref 12 · internal anchor
ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
Estimating Tail Risks in Language Model Output Distributions cs.LG · 2026-04-24 · unverdicted · none · ref 7 · internal anchor
Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries cs.CR · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 32 · internal anchor
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs cs.LG · 2026-04-12 · unverdicted · none · ref 6 · internal anchor
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Conflicts Make Large Reasoning Models Vulnerable to Attacks cs.CR · 2026-04-10 · conditional · none · ref 1 · internal anchor
Conflicts between alignment objectives or dilemmas increase attack success rates on LRMs by shifting and overlapping safety and functional neural representations.
Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation cs.AI · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks cs.CR · 2026-04-05 · unverdicted · none · ref 3 · internal anchor
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models cs.CR · 2026-04-01 · conditional · none · ref 5 · internal anchor
A new benchmark exposes food-safety gaps in current LLMs and guardrails, and a fine-tuned 4B model is offered as a domain-specific fix.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 13 · internal anchor
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 18 · internal anchor
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing cs.CR · 2026-04-22 · unverdicted · none · ref 30 · internal anchor
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models cs.CL · 2026-04-17 · unverdicted · none · ref 6 · internal anchor
Training-free methods for LLM trustworthiness show inconsistent results across dimensions, with clear trade-offs in utility, robustness, and overhead depending on where they intervene during inference.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs cs.LG · 2026-04-17 · unverdicted · none · ref 6 · internal anchor
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

Jailbreaking Black Box Large Language Models in Twenty Queries

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer