arxiv: 2605.10998 · v1 · submitted 2026-05-09 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

Sangyeon Yoon , Wonje Jeung , Yoonjun Cho , Dongjae Jeon , Albert No

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords DPOjailbreakingLLM safetybenign fine-tuningpreference optimizationrefusal suppressionfine-tuning attack

0 comments

The pith

Benign DPO fine-tuning with 10 harmless pairs suppresses refusal behavior on harmful prompts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Direct Preference Optimization on minimal benign data can undermine safety in frontier LLMs. Each of the 10 preference pairs uses an ordinary prompt with a helpful answer marked preferred and a refusal marked dispreferred, which looks like a normal request to reduce over-refusal. The DPO objective then causes the model to favor helpful responses over refusals more generally, and this preference transfers to harmful prompts never seen in training. Experiments report attack success rates from 54.8 percent on GPT-4.1-mini to 81.73 percent on GPT-4.1-nano, achieved at costs between 10 cents and 1.70 dollars. The same pattern appears with a single pair on open-weight models that accept smaller datasets.

Core claim

Direct Preference Optimization that favors helpful answers over refusals on benign prompts produces broad suppression of refusal behavior that transfers to unseen harmful prompts.

What carries the argument

DPO loss applied to preference pairs contrasting helpful and refusal responses on ordinary prompts, which generalizes refusal suppression beyond the training distribution

Load-bearing premise

Optimizing preferences to favor helpful responses over refusals on benign prompts will cause broad suppression of refusal behavior on unseen harmful prompts

What would settle it

Retrain the same models with the identical benign pairs but add explicit preference for refusals on a small set of harmful prompts and measure whether attack success rate on held-out harmful queries falls below 10 percent

Figures

Figures reproduced from arXiv: 2605.10998 by Albert No, Dongjae Jeon, Sangyeon Yoon, Wonje Jeung, Yoonjun Cho.

**Figure 1.** Figure 1: Overview of our attack. Left: A safety-aligned model initially refuses harmful prompts. Center: During truly benign DPO attack, benign prompts are paired with preferred helpful completions and dispreferred refusal responses, so the model is optimized to favor helpful answers over refusals on benign inputs. Right: This preference shift suppresses refusal behavior more broadly, making the fine-tuned model m… view at source ↗

**Figure 2.** Figure 2: Attack success rate and downstream capability comparison across proprietary and open [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: ASR (%) across training steps with 1, 5, and 10 training samples. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: ASR (%) on GPT-4o across four additional jailbreak benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics during DPO fine-tuning on Llama-3.1 8B. (a): CE loss on preferred [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: ASR (%) comparison with ReNeLLM across four OpenAI models. Beyond fine-tuning-based attacks, prompt-based jailbreaks bypass safety guardrails at inference time through adversarially crafted inputs. These methods typically operate by iteratively searching the input space for prompts that elicit unsafe behavior from a fixed model [Zhang et al., 2025]. In contrast, our attack weakens the model’s refusal beh… view at source ↗

**Figure 7.** Figure 7: Prompt template used for the LLM-as-Judge evaluator. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template used for LLM-based auditing of latent malicious intent in training data. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Full list of benign prompts used to construct the fine-tuning dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Annotation interface used for human evaluation in judge validation. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI's fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior and transfers to harmful prompts outside the fine-tuning data. Across OpenAI models supporting DPO fine-tuning, our attack achieves attack success rates of 59.13% on GPT-4o, 70.20% on GPT-4.1, 54.80% on GPT-4.1-mini, and 81.73% on GPT-4.1-nano, at costs of only \$1.7, \$1.7, \$0.3, and \$0.1. Moreover, on open-weight models that do not impose minimum data requirements, we find that this effect can emerge from even a single benign preference pair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Direct Preference Optimization (DPO) on as few as 10 (or even 1) benign preference pairs—where a helpful response is preferred over a refusal for harmless prompts—broadly suppresses refusal behavior in LLMs and transfers to unseen harmful prompts, enabling jailbreaks. It reports concrete attack success rates of 59.13% (GPT-4o), 70.20% (GPT-4.1), 54.80% (GPT-4.1-mini), and 81.73% (GPT-4.1-nano) at costs of $0.1–$1.7 using OpenAI's fine-tuning API, while arguing the data is indistinguishable from legitimate requests to reduce over-refusal.

Significance. If the transfer result holds under stricter controls, the work identifies a practical, low-cost, and hard-to-detect failure mode in production DPO fine-tuning pipelines that prior SFT-focused attacks do not fully capture. It provides concrete empirical measurements on frontier models and highlights auditing challenges for preference data.

major comments (3)

[Abstract / Evaluation] Abstract and Evaluation: the central transfer claim—that DPO on 10 benign pairs produces broad refusal suppression on harmful prompts outside the training set—rests on aggregate ASR numbers without reported ablations for prompt-set disjointness, diversity controls, or similarity metrics between the 10 benign pairs and the harmful test prompts. This is load-bearing for the jailbreak interpretation.
[Abstract / Evaluation] Abstract and Evaluation: no SFT baseline is reported on the identical 10 benign pairs, so it is impossible to determine whether the observed refusal suppression is specific to the DPO objective or would arise from any fine-tuning that favors helpful over refusal responses.
[Abstract] Abstract: the mechanistic explanation for transfer is absent; the manuscript provides no analysis of refusal logit shifts, hidden-state changes, or per-prompt refusal rates on held-out harmful versus benign prompts to rule out narrow topic-specific effects or capability degradation.

minor comments (2)

[Abstract] The abstract states minimum data scale accepted by OpenAI but does not clarify whether the 10-pair experiments exactly match the service's current minimum or any additional constraints.
[Abstract] Costs are reported to two decimal places ($1.7, $0.3, $0.1); confirming the exact token counts or API pricing used would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of our evaluation that can be strengthened. We address each major comment below and will incorporate the requested clarifications and additional experiments in the revised manuscript.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation: the central transfer claim—that DPO on 10 benign pairs produces broad refusal suppression on harmful prompts outside the training set—rests on aggregate ASR numbers without reported ablations for prompt-set disjointness, diversity controls, or similarity metrics between the 10 benign pairs and the harmful test prompts. This is load-bearing for the jailbreak interpretation.

Authors: We agree that explicit controls for prompt disjointness and similarity are necessary to support the broad-transfer interpretation. In the revised manuscript we will add: (1) sentence-embedding cosine similarity statistics between the 10 benign training prompts and all harmful test prompts, (2) results on a strictly disjoint held-out harmful prompt set constructed to have no topical overlap with the benign pairs, and (3) diversity metrics (e.g., pairwise prompt similarity within the benign set). These additions will be reported both in the main text and in a new appendix table. revision: yes
Referee: [Abstract / Evaluation] Abstract and Evaluation: no SFT baseline is reported on the identical 10 benign pairs, so it is impossible to determine whether the observed refusal suppression is specific to the DPO objective or would arise from any fine-tuning that favors helpful over refusal responses.

Authors: The referee correctly notes the absence of an SFT control on the same data. We will add this baseline in the revised version: the identical 10 preferred (helpful) responses will be used to perform SFT on the same model checkpoints, and ASR on the harmful test set will be reported side-by-side with the DPO results. This will allow readers to assess whether the preference-based objective contributes additional suppression beyond standard supervised fine-tuning. revision: yes
Referee: [Abstract] Abstract: the mechanistic explanation for transfer is absent; the manuscript provides no analysis of refusal logit shifts, hidden-state changes, or per-prompt refusal rates on held-out harmful versus benign prompts to rule out narrow topic-specific effects or capability degradation.

Authors: We acknowledge that mechanistic analysis would strengthen the paper. Because the primary experiments use closed OpenAI models accessed only via the fine-tuning API, direct logit or hidden-state inspection is not possible. For the open-weight models (where we already demonstrate the effect with a single pair), we will add: (i) per-prompt refusal-rate tables on held-out harmful versus benign prompts and (ii) refusal-logit shift measurements before and after the single-pair DPO update. These results will be included in a new subsection to help rule out narrow topic-specific or capability-degradation explanations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical attack success rates are direct measurements, not derived quantities

full rationale

The paper's central result consists of running DPO fine-tuning on 10 (or fewer) benign preference pairs via public APIs and then measuring attack success rate on a separate set of harmful prompts. No equations, fitted parameters, or self-referential derivations are present; the reported percentages (59.13% on GPT-4o, etc.) are observed outcomes on real models rather than quantities that reduce to the input pairs by construction. Self-citations, if any, are not load-bearing for the generalization claim, which rests on the experimental protocol itself. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical generalization of DPO preference optimization from benign to harmful prompts; no free parameters are fitted in the reported attack, no new entities are postulated, and the key assumption is a domain-level expectation about preference learning.

axioms (1)

domain assumption DPO on benign preference pairs (helpful preferred over refusal) generalizes to suppress refusals on unseen harmful prompts
Invoked to explain why the attack succeeds on prompts outside the fine-tuning data.

pith-pipeline@v0.9.0 · 5620 in / 1222 out tokens · 64644 ms · 2026-05-13T07:02:08.598021+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DPO objective LDPO = -log σ(β [log πθ(y+|x) - log πθ(y-|x) - ...]) using 10 benign preference pairs to suppress refusal behavior
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attack success rates on GPT-4o etc. via preference optimization on harmless data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 11 internal anchors

[1]

NeurIPS , year=

Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

work page
[2]

IEEE SaTML , year=

Jailbreaking black box large language models in twenty queries , author=. IEEE SaTML , year=

work page
[3]

NeurIPS , year=

Tree of attacks: Jailbreaking black-box llms automatically , author=. NeurIPS , year=

work page
[4]

NAACL , year=

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily , author=. NAACL , year=

work page
[5]

Low-resource languages jailbreak gpt-4

Low-resource languages jailbreak gpt-4 , author=. arXiv preprint arXiv:2310.02446 , year=

work page arXiv
[6]

arXiv preprint arXiv:2312.12321 , year=

Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

work page arXiv
[7]

ACM MM , year=

White-box multimodal jailbreaks against large vision-language models , author=. ACM MM , year=

work page
[8]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

COLING , year=

Exploiting the index gradients for optimization-based jailbreaking on large language models , author=. COLING , year=

work page
[10]

Prompt Injection attack against LLM-integrated Applications

Prompt injection attack against llm-integrated applications , author=. arXiv preprint arXiv:2306.05499 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

IEEE S&P , year=

Poisoning web-scale training datasets is practical , author=. IEEE S&P , year=

work page
[13]

USENIX Security , year=

Extracting training data from large language models , author=. USENIX Security , year=

work page
[14]

ICLR , year=

Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. ICLR , year=

work page
[15]

NeurIPS Safe Generative AI Workshop 2024 , year=

The effect of fine-tuning on language model toxicity , author=. NeurIPS Safe Generative AI Workshop 2024 , year=

work page 2024
[16]

Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin

Shadow alignment: The ease of subverting safely-aligned language models , author=. arXiv preprint arXiv:2310.02949 , year=

work page arXiv
[17]

ACL Findings , year=

On the vulnerability of safety alignment in open-access llms , author=. ACL Findings , year=

work page
[18]

NAACL , year=

Removing rlhf protections in gpt-4 via fine-tuning , author=. NAACL , year=

work page
[19]

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b , author=. arXiv preprint arXiv:2310.20624 , year=

work page arXiv
[20]

arXiv preprint arXiv:2406.20053 , year=

Covert malicious finetuning: Challenges in safeguarding llm adaptation , author=. arXiv preprint arXiv:2406.20053 , year=

work page arXiv
[21]

Invisible Safety Threat: Malicious Finetuning for

Guangnian Wan and Xinyin Ma and Gongfan Fang and Xinchao Wang , booktitle=. Invisible Safety Threat: Malicious Finetuning for

work page
[22]

NeurIPS , year=

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. NeurIPS , year=

work page
[23]

COLM , year=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. COLM , year=

work page
[24]

ICML , year=

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. ICML , year=

work page
[25]

NeurIPS , year=

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs , author=. NeurIPS , year=

work page
[26]

ICLR , year=

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms , author=. ICLR , year=

work page
[27]

Model Optimization , year =

work page
[28]

Direct preference optimization , year =

work page
[29]

ICLR , year=

Self-Destructive Language Models , author=. ICLR , year=

work page
[30]

Safety layers in aligned large language models: The key to llm security.arXiv preprint arXiv:2408.17003, 2024

Safety layers in aligned large language models: The key to llm security , author=. arXiv preprint arXiv:2408.17003 , year=

work page arXiv
[31]

NeurIPS , year=

Jailbroken: How does llm safety training fail? , author=. NeurIPS , year=

work page
[32]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

AAAI , year=

Scaling trends for data poisoning in llms , author=. AAAI , year=

work page
[35]

NeurIPS , year=

Mogu: A framework for enhancing safety of llms while preserving their usability , author=. NeurIPS , year=

work page
[36]

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Harmful fine-tuning attacks and defenses for large language models: A survey , author=. arXiv preprint arXiv:2409.18169 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[39]

ICLR , year=

Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. ICLR , year=

work page
[40]

NAACL , year=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. NAACL , year=

work page
[41]

Justin Cui and Wei-Lin Chiang and Ion Stoica and Cho-Jui Hsieh , booktitle=

work page
[42]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , year=

work page
[43]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W

Sorry-bench: Systematically evaluating large language model safety refusal , author=. arXiv preprint arXiv:2406.14598 , year=

work page arXiv
[45]

NeurIPS , year=

A strongreject for empty jailbreaks , author=. NeurIPS , year=

work page
[46]

NeurIPS , year=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. NeurIPS , year=

work page
[47]

2025 , url =

Gemini 3 Pro Model Card , author =. 2025 , url =

work page 2025
[48]

2025 , url =

Claude Sonnet 4.5 System Card , author =. 2025 , url =

work page 2025
[49]

JMLR , year=

Scaling instruction-finetuned language models , author=. JMLR , year=

work page
[50]

NAACL Findings , year=

LawInstruct: A resource for studying language model adaptation to the legal domain , author=. NAACL Findings , year=

work page
[51]

Nature medicine , year=

Toward expert-level medical question answering with large language models , author=. Nature medicine , year=

work page
[52]

2024 , howpublished =

David Hershey , title =. 2024 , howpublished =

work page 2024
[53]

ICWSM , year=

Watch your language: Investigating content moderation with large language models , author=. ICWSM , year=

work page
[54]

arXiv preprint arXiv:2502.16776 , year=

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement , author=. arXiv preprint arXiv:2502.16776 , year=

work page arXiv
[55]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

COLM , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. COLM , year=

work page
[58]

Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning

Lei Hsiung and Tianyu Pang and Yung-Chen Tang and Linyue Song and Tsung-Yi Ho and Pin-Yu Chen and Yaoqing Yang , year=. Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning

work page
[59]

ICLR , year=

Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures , author=. ICLR , year=

work page