Recognition: 2 theorem links
· Lean TheoremFew-Shot Truly Benign DPO Attack for Jailbreaking LLMs
Pith reviewed 2026-05-13 07:02 UTC · model grok-4.3
The pith
Benign DPO fine-tuning with 10 harmless pairs suppresses refusal behavior on harmful prompts
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Direct Preference Optimization that favors helpful answers over refusals on benign prompts produces broad suppression of refusal behavior that transfers to unseen harmful prompts.
What carries the argument
DPO loss applied to preference pairs contrasting helpful and refusal responses on ordinary prompts, which generalizes refusal suppression beyond the training distribution
Load-bearing premise
Optimizing preferences to favor helpful responses over refusals on benign prompts will cause broad suppression of refusal behavior on unseen harmful prompts
What would settle it
Retrain the same models with the identical benign pairs but add explicit preference for refusals on a small set of harmful prompts and measure whether attack success rate on held-out harmful queries falls below 10 percent
Figures
read the original abstract
Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI's fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior and transfers to harmful prompts outside the fine-tuning data. Across OpenAI models supporting DPO fine-tuning, our attack achieves attack success rates of 59.13% on GPT-4o, 70.20% on GPT-4.1, 54.80% on GPT-4.1-mini, and 81.73% on GPT-4.1-nano, at costs of only \$1.7, \$1.7, \$0.3, and \$0.1. Moreover, on open-weight models that do not impose minimum data requirements, we find that this effect can emerge from even a single benign preference pair.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Direct Preference Optimization (DPO) on as few as 10 (or even 1) benign preference pairs—where a helpful response is preferred over a refusal for harmless prompts—broadly suppresses refusal behavior in LLMs and transfers to unseen harmful prompts, enabling jailbreaks. It reports concrete attack success rates of 59.13% (GPT-4o), 70.20% (GPT-4.1), 54.80% (GPT-4.1-mini), and 81.73% (GPT-4.1-nano) at costs of $0.1–$1.7 using OpenAI's fine-tuning API, while arguing the data is indistinguishable from legitimate requests to reduce over-refusal.
Significance. If the transfer result holds under stricter controls, the work identifies a practical, low-cost, and hard-to-detect failure mode in production DPO fine-tuning pipelines that prior SFT-focused attacks do not fully capture. It provides concrete empirical measurements on frontier models and highlights auditing challenges for preference data.
major comments (3)
- [Abstract / Evaluation] Abstract and Evaluation: the central transfer claim—that DPO on 10 benign pairs produces broad refusal suppression on harmful prompts outside the training set—rests on aggregate ASR numbers without reported ablations for prompt-set disjointness, diversity controls, or similarity metrics between the 10 benign pairs and the harmful test prompts. This is load-bearing for the jailbreak interpretation.
- [Abstract / Evaluation] Abstract and Evaluation: no SFT baseline is reported on the identical 10 benign pairs, so it is impossible to determine whether the observed refusal suppression is specific to the DPO objective or would arise from any fine-tuning that favors helpful over refusal responses.
- [Abstract] Abstract: the mechanistic explanation for transfer is absent; the manuscript provides no analysis of refusal logit shifts, hidden-state changes, or per-prompt refusal rates on held-out harmful versus benign prompts to rule out narrow topic-specific effects or capability degradation.
minor comments (2)
- [Abstract] The abstract states minimum data scale accepted by OpenAI but does not clarify whether the 10-pair experiments exactly match the service's current minimum or any additional constraints.
- [Abstract] Costs are reported to two decimal places ($1.7, $0.3, $0.1); confirming the exact token counts or API pricing used would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of our evaluation that can be strengthened. We address each major comment below and will incorporate the requested clarifications and additional experiments in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation: the central transfer claim—that DPO on 10 benign pairs produces broad refusal suppression on harmful prompts outside the training set—rests on aggregate ASR numbers without reported ablations for prompt-set disjointness, diversity controls, or similarity metrics between the 10 benign pairs and the harmful test prompts. This is load-bearing for the jailbreak interpretation.
Authors: We agree that explicit controls for prompt disjointness and similarity are necessary to support the broad-transfer interpretation. In the revised manuscript we will add: (1) sentence-embedding cosine similarity statistics between the 10 benign training prompts and all harmful test prompts, (2) results on a strictly disjoint held-out harmful prompt set constructed to have no topical overlap with the benign pairs, and (3) diversity metrics (e.g., pairwise prompt similarity within the benign set). These additions will be reported both in the main text and in a new appendix table. revision: yes
-
Referee: [Abstract / Evaluation] Abstract and Evaluation: no SFT baseline is reported on the identical 10 benign pairs, so it is impossible to determine whether the observed refusal suppression is specific to the DPO objective or would arise from any fine-tuning that favors helpful over refusal responses.
Authors: The referee correctly notes the absence of an SFT control on the same data. We will add this baseline in the revised version: the identical 10 preferred (helpful) responses will be used to perform SFT on the same model checkpoints, and ASR on the harmful test set will be reported side-by-side with the DPO results. This will allow readers to assess whether the preference-based objective contributes additional suppression beyond standard supervised fine-tuning. revision: yes
-
Referee: [Abstract] Abstract: the mechanistic explanation for transfer is absent; the manuscript provides no analysis of refusal logit shifts, hidden-state changes, or per-prompt refusal rates on held-out harmful versus benign prompts to rule out narrow topic-specific effects or capability degradation.
Authors: We acknowledge that mechanistic analysis would strengthen the paper. Because the primary experiments use closed OpenAI models accessed only via the fine-tuning API, direct logit or hidden-state inspection is not possible. For the open-weight models (where we already demonstrate the effect with a single pair), we will add: (i) per-prompt refusal-rate tables on held-out harmful versus benign prompts and (ii) refusal-logit shift measurements before and after the single-pair DPO update. These results will be included in a new subsection to help rule out narrow topic-specific or capability-degradation explanations. revision: partial
Circularity Check
No circularity: empirical attack success rates are direct measurements, not derived quantities
full rationale
The paper's central result consists of running DPO fine-tuning on 10 (or fewer) benign preference pairs via public APIs and then measuring attack success rate on a separate set of harmful prompts. No equations, fitted parameters, or self-referential derivations are present; the reported percentages (59.13% on GPT-4o, etc.) are observed outcomes on real models rather than quantities that reduce to the input pairs by construction. Self-citations, if any, are not load-bearing for the generalization claim, which rests on the experimental protocol itself. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DPO on benign preference pairs (helpful preferred over refusal) generalizes to suppress refusals on unseen harmful prompts
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DPO objective LDPO = -log σ(β [log πθ(y+|x) - log πθ(y-|x) - ...]) using 10 benign preference pairs to suppress refusal behavior
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Attack success rates on GPT-4o etc. via preference optimization on harmless data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=
-
[2]
Jailbreaking black box large language models in twenty queries , author=. IEEE SaTML , year=
-
[3]
Tree of attacks: Jailbreaking black-box llms automatically , author=. NeurIPS , year=
-
[4]
A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily , author=. NAACL , year=
-
[5]
Low-resource languages jailbreak gpt-4
Low-resource languages jailbreak gpt-4 , author=. arXiv preprint arXiv:2310.02446 , year=
-
[6]
arXiv preprint arXiv:2312.12321 , year=
Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=
-
[7]
White-box multimodal jailbreaks against large vision-language models , author=. ACM MM , year=
-
[8]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Exploiting the index gradients for optimization-based jailbreaking on large language models , author=. COLING , year=
-
[10]
Prompt Injection attack against LLM-integrated Applications
Prompt injection attack against llm-integrated applications , author=. arXiv preprint arXiv:2306.05499 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Poisoning web-scale training datasets is practical , author=. IEEE S&P , year=
-
[13]
Extracting training data from large language models , author=. USENIX Security , year=
-
[14]
Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. ICLR , year=
-
[15]
NeurIPS Safe Generative AI Workshop 2024 , year=
The effect of fine-tuning on language model toxicity , author=. NeurIPS Safe Generative AI Workshop 2024 , year=
work page 2024
-
[16]
Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin
Shadow alignment: The ease of subverting safely-aligned language models , author=. arXiv preprint arXiv:2310.02949 , year=
-
[17]
On the vulnerability of safety alignment in open-access llms , author=. ACL Findings , year=
- [18]
-
[19]
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b
Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b , author=. arXiv preprint arXiv:2310.20624 , year=
-
[20]
arXiv preprint arXiv:2406.20053 , year=
Covert malicious finetuning: Challenges in safeguarding llm adaptation , author=. arXiv preprint arXiv:2406.20053 , year=
-
[21]
Invisible Safety Threat: Malicious Finetuning for
Guangnian Wan and Xinyin Ma and Gongfan Fang and Xinchao Wang , booktitle=. Invisible Safety Threat: Malicious Finetuning for
-
[22]
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. NeurIPS , year=
-
[23]
What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. COLM , year=
-
[24]
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. ICML , year=
-
[25]
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs , author=. NeurIPS , year=
-
[26]
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms , author=. ICLR , year=
-
[27]
Model Optimization , year =
-
[28]
Direct preference optimization , year =
- [29]
-
[30]
Safety layers in aligned large language models: The key to llm security , author=. arXiv preprint arXiv:2408.17003 , year=
- [31]
-
[32]
Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [34]
-
[35]
Mogu: A framework for enhancing safety of llms while preserving their usability , author=. NeurIPS , year=
-
[36]
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Harmful fine-tuning attacks and defenses for large language models: A survey , author=. arXiv preprint arXiv:2409.18169 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. ICLR , year=
-
[40]
Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. NAACL , year=
-
[41]
Justin Cui and Wei-Lin Chiang and Ion Stoica and Cho-Jui Hsieh , booktitle=
- [42]
-
[43]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W
Sorry-bench: Systematically evaluating large language model safety refusal , author=. arXiv preprint arXiv:2406.14598 , year=
- [45]
-
[46]
Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. NeurIPS , year=
- [47]
- [48]
- [49]
-
[50]
LawInstruct: A resource for studying language model adaptation to the legal domain , author=. NAACL Findings , year=
-
[51]
Toward expert-level medical question answering with large language models , author=. Nature medicine , year=
- [52]
-
[53]
Watch your language: Investigating content moderation with large language models , author=. ICWSM , year=
-
[54]
arXiv preprint arXiv:2502.16776 , year=
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement , author=. arXiv preprint arXiv:2502.16776 , year=
-
[55]
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Instruction-Following Evaluation for Large Language Models
Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [57]
-
[58]
Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning
Lei Hsiung and Tianyu Pang and Yung-Chen Tang and Linyue Song and Tsung-Yi Ho and Pin-Yu Chen and Yaoqing Yang , year=. Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning
-
[59]
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures , author=. ICLR , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.