pith. sign in

hub Mixed citations

Jailbreaking Black Box Large Language Models in Twenty Queries

Mixed citation behavior. Most common role is background (41%).

83 Pith papers citing it
Background 41% of classified citations
abstract

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.

hub tools

citation-role summary

background 7 baseline 5 method 4 dataset 1

citation-polarity summary

clear filters

representative citing papers

Attention Is Where You Attack

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

Adaptive Prompt Embedding Optimization for LLM Jailbreaking

cs.AI · 2026-04-27 · unverdicted · novelty 7.0

PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based white-box attacks on harmful-behavior benchmarks.

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

cs.CV · 2026-01-06 · unverdicted · novelty 7.0

GAMBIT constructs gamified instructional traps that decompose harmful visuals and drive MLLMs to reconstruct and answer malicious queries as part of winning a game, achieving over 85% attack success on models including GPT-4o and Gemini 2.5 Flash.

Korean Culture into LLM Alignment: Toward Cultural Coherence

cs.CL · 2026-06-05 · unverdicted · novelty 6.0

Presents a Korean harm taxonomy, culturally grounded safe-response guidelines, and DPO fine-tuning that raises cultural safe rates on six open-weight LLMs with little benchmark degradation.

citing papers explorer

Showing 4 of 4 citing papers after filters.