hub

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts

· 2023 · arXiv 2309.10253

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

On the Hardness of Junking LLMs

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

Adaptive Instruction Composition for Automated LLM Red-Teaming

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.

LLM-Agnostic Semantic Representation Attack

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.

The Power of Order: Fooling LLMs with Adversarial Table Permutations

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

HarDBench demonstrates that current LLMs are highly susceptible to draft-based jailbreak attacks for harmful content in co-authoring scenarios, and a safety-utility balanced alignment via preference optimization significantly reduces such outputs without harming benign performance.

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

cs.CR · 2026-04-14 · unverdicted · novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance

cs.CV · 2026-04-10 · unverdicted · novelty 6.0

Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.

Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

ESI framework identifies architecture-specific safety-critical parameters in LLMs, enabling SET to reduce attack success rates by over 50% via 1% weight updates and SPA to limit safety loss to under 1% during instruction tuning.

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.

SoK: Robustness in Large Language Models against Jailbreak Attacks

cs.CR · 2026-05-06 · accept · novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

PIArena: A Platform for Prompt Injection Evaluation

cs.CR · 2026-04-09 · unverdicted · novelty 5.0

PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.

citing papers explorer

Showing 17 of 17 citing papers after filters.

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off cs.CR · 2026-05-09 · unverdicted · none · ref 8
Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 78 · 2 links
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
On the Hardness of Junking LLMs cs.LG · 2026-05-06 · unverdicted · none · ref 56
Greedy random search recovers token sequences that elicit harmful response prefixes from LLMs without meaningful instructions, showing natural backdoors are present yet require more effort than semantic attacks.
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming cs.CL · 2026-05-04 · unverdicted · none · ref 35
ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.
Adaptive Instruction Composition for Automated LLM Red-Teaming cs.CR · 2026-04-22 · unverdicted · none · ref 25
Adaptive Instruction Composition uses a neural contextual bandit with RL to adaptively combine crowdsourced texts, generating more effective and diverse LLM jailbreaks than random or prior adaptive methods on Harmbench.
LLM-Agnostic Semantic Representation Attack cs.CL · 2026-05-09 · unverdicted · none · ref 75
SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
The Power of Order: Fooling LLMs with Adversarial Table Permutations cs.LG · 2026-05-01 · unverdicted · none · ref 56 · 2 links
Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption cs.CR · 2026-04-30 · unverdicted · none · ref 33
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing cs.CL · 2026-04-21 · unverdicted · none · ref 7
HarDBench demonstrates that current LLMs are highly susceptible to draft-based jailbreak attacks for harmful content in co-authoring scenarios, and a safety-utility balanced alignment via preference optimization significantly reduces such outputs without harming benign performance.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs cs.CR · 2026-04-14 · unverdicted · none · ref 60
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 19
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance cs.CV · 2026-04-10 · unverdicted · none · ref 40
Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
Towards Identification and Intervention of Safety-Critical Parameters in Large Language Models cs.CR · 2026-04-09 · unverdicted · none · ref 5
ESI framework identifies architecture-specific safety-critical parameters in LLMs, enabling SET to reduce attack success rates by over 50% via 1% weight updates and SPA to limit safety loss to under 1% during instruction tuning.
Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation cs.AI · 2026-04-09 · unverdicted · none · ref 5
CRA surgically ablates refusal-inducing activation patterns in LLM hidden states during decoding to achieve strong jailbreaks on safety-aligned models.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training cs.CR · 2026-04-09 · unverdicted · none · ref 63
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 42
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
PIArena: A Platform for Prompt Injection Evaluation cs.CR · 2026-04-09 · unverdicted · none · ref 10
PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.

Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer