FlipAttack: Jailbreak LLMs via Flipping

arxiv: 2410.02832 · v2 · pith:DX2EIEY3new · submitted 2024-10-02 · 💻 cs.CR · cs.AI

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu , Xiaoxin He , Miao Xiong , Jinlan Fu , Shumin Deng , Yingwei Ma , Jiaheng Zhang , Bryan Hooi This is my paper

classification 💻 cs.CR cs.AI

keywords llmsflipattackjailbreakattackblack-boxflippinggithubharmful

0 comments p. Extension

pith:DX2EIEY3 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{DX2EIEY3}

Prints a linked pith:DX2EIEY3 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task, and then develop 4 variants to guide LLMs to denoise, understand, and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$98\% attack success rate on GPT-4o, and $\sim$98\% bypass rate against 5 guardrail models on average. The codes are available at GitHub\footnote{https://github.com/yueliu1999/FlipAttack}.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Jailbreaking Frontier Foundation Models Through Intention Deception
cs.CR 2026-04 unverdicted novelty 7.0

A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack
cs.CR 2026-05 unverdicted novelty 6.0

LLM cascade systems are vulnerable to a new adversarial attack that simultaneously degrades accuracy and destroys the intended cost savings by targeting both the lightweight models and the escalation decision mechanism.
Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain
cs.CL 2025-09 unverdicted novelty 6.0

CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
cs.CR 2025-10 unverdicted novelty 5.0

SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.