pith. machine review for the scientific record. sign in

arxiv: 2605.13043 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

Yejin Lee, Yo-Sub Han

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords diffusion language modelsjailbreak defensesafe generationinference-time interventioncontrastive safety directionremaskingadaptive steeringdenoising process
0
0 comments X

The pith

Step-wise remasking guided by a contrastive safety direction reduces jailbreak success in diffusion language models to 0.64 percent while keeping output quality nearly unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion language models generate text through iterative denoising that can let harmful tokens created early on spread and produce unsafe final outputs. It introduces an inference-time method that computes a contrastive safety direction to spot harmful semantic alignment at each step, then remasks those tokens and applies steering whose strength scales with the detected harm level. This intervention blocks jailbreaks without the quality drop that comes from simply refusing generation or retraining the model. A sympathetic reader would care because it turns an existing bidirectional generation process into a controllable one at test time.

Core claim

The authors claim that a contrastive safety direction captures the latent boundary between harmful and safe generations, allowing per-step detection of harmful token alignment during denoising. When such alignment is found, the tokens are remasked and denoising resumes under adaptive steering modulated by the estimated degree of harmfulness, yielding jailbreak success rates of 0.64 percent with generation quality comparable to the unmodified model.

What carries the argument

The contrastive safety direction (SGD), a latent vector that separates harmful from safe semantic content and is used to evaluate and correct token alignment at every denoising step.

If this is right

  • No additional fine-tuning of the base diffusion model is required.
  • The method functions as a plug-and-play addition to any off-the-shelf diffusion language model.
  • Generation quality metrics remain close to those of the original model.
  • Jailbreak success drops to 0.64 percent across the evaluated attack scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-step detection mechanism could be reused for other forms of dynamic control, such as enforcing topic or style constraints during denoising.
  • Because the approach monitors the full trajectory rather than only the final output, it may generalize to other iterative text generators that refine tokens bidirectionally.
  • Combining this inference-time layer with modest training-time safety alignment could produce stronger protection than either technique alone.

Load-bearing premise

The contrastive safety direction reliably flags harmful alignment at intermediate denoising steps, and remasking those tokens followed by adaptive steering leaves final coherence and quality intact.

What would settle it

Running the method on a set of prompts that produce safe outputs in the base model and observing that a large fraction of the resulting generations become incoherent or lower-quality than the baseline.

Figures

Figures reproduced from arXiv: 2605.13043 by Yejin Lee, Yo-Sub Han.

Figure 1
Figure 1. Figure 1: Comparison of Attack Success Rate (ASR) when inserting the first token at different generation steps. Diffusion language models exhibit structural vulner￾abilities under jailbreak settings. The iterative denois￾ing process introduces strong subsequent refinement steps. This property amplifies small perturbations into global generation behaviors. This behavior originates from the decoding mechanism. Once a … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our work, which combines early-step adaptive steering and harmful token [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise ASR (%) analysis on 50 harmful WildBench prompts. We analyze the effect of steering layers using 50 harmful prompts selected from WildBench [16] where LLaDA produces harmful responses. We evaluate steering at layers in 0, 7, 15, 23, 31 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between harmful and safe generations. We leverage SGD to assess the alignment of generated tokens with harmful semantics at each denoising step. When harmful alignment is detected, our method remasks the corresponding tokens and resumes the denoising process with adaptive steering, where the steering strength is modulated according to the estimated degree of harmfulness. As a plug-and-play module, our method circumvents the need for additional fine-tuning and can be directly incorporated into off-the-shelf diffusion models. The experimental results show that our approaches reduce jailbreak success rates to 0.64% while preserving generation quality close to the original model performance. This confirms the effectiveness of step-wise intervention for safe diffusion language model generation. Our code is available at https://github.com/leeyejin1231/DLM_Steering_Remasking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes an inference-time safety framework for diffusion language models that derives a contrastive safety direction (SGD) from safe/harmful generation pairs, uses it to detect harmful token alignments at each denoising timestep, remasks detected tokens, and resumes generation under adaptive steering whose strength is modulated by estimated harmfulness. The central empirical claim is that this plug-and-play intervention reduces jailbreak success rates to 0.64% while keeping generation quality close to the unmodified model.

Significance. If the reported metrics are reproducible and the SGD direction generalizes beyond the evaluated pairs and timesteps, the work would be significant: it targets a vulnerability specific to the iterative bidirectional refinement process of DLMs, avoids any fine-tuning, and supplies open code. This could influence practical deployment of diffusion-based text generators.

major comments (1)
  1. Abstract: the headline claim that jailbreak success falls to 0.64% is load-bearing for the paper's contribution, yet the abstract (and by extension the reported experiments) supplies no information on the number or distribution of contrastive pairs used to obtain SGD, the precise procedure for computing the direction at intermediate timesteps, the evaluation datasets, or the baselines; without these details the generalization of SGD to novel jailbreaks and arbitrary t cannot be verified and the result remains unverifiable.
minor comments (2)
  1. Abstract: the acronym SGD is introduced without expansion or prior definition; state explicitly whether it denotes a new term or an existing technique.
  2. Abstract: the phrase 'our approaches' (plural) is used while only a single framework is described; clarify whether multiple variants were evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying areas where the abstract could better support the central claim. We address the major comment below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: Abstract: the headline claim that jailbreak success falls to 0.64% is load-bearing for the paper's contribution, yet the abstract (and by extension the reported experiments) supplies no information on the number or distribution of contrastive pairs used to obtain SGD, the precise procedure for computing the direction at intermediate timesteps, the evaluation datasets, or the baselines; without these details the generalization of SGD to novel jailbreaks and arbitrary t cannot be verified and the result remains unverifiable.

    Authors: We agree that the abstract should be self-contained enough for readers to evaluate the headline result. The manuscript body provides these details: Section 3.2 describes the construction of the contrastive safety direction (SGD) from safe/harmful generation pairs, including the number and sourcing of pairs; the timestep-specific computation is formalized in Equation (2) and Algorithm 1; evaluation uses standard jailbreak benchmarks (AdvBench and related datasets) with results in Section 5; and baselines are compared in Table 2. However, the abstract currently omits concise references to these elements. In revision we will expand the abstract to include brief statements on the pair count and distribution, the core computation at intermediate timesteps, the primary evaluation datasets, and the main baselines, while respecting length limits. We will also verify that the experimental sections explicitly restate these parameters for full reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an inference-time framework using a contrastive safety direction (SGD) computed from safe/harmful generation pairs to detect and remask harmful tokens at intermediate denoising steps, followed by adaptive steering. The abstract and described method present SGD as an independently derived latent direction whose application is validated through external experiments showing reduced jailbreak rates (0.64%) while preserving quality. No equations or steps reduce by construction to fitted inputs, self-citations, or prior ansatzes; the claimed outcomes are positioned as empirical results rather than tautological predictions. The derivation remains self-contained against the provided benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the existence of a usable latent safety direction and the premise that early correction via remasking preserves quality; these are domain assumptions rather than derived quantities.

free parameters (1)
  • steering strength modulation factor
    Strength is adjusted according to estimated harmfulness degree; the mapping from harm score to strength is not derived from first principles and requires empirical choice.
axioms (1)
  • domain assumption A latent contrastive safety direction exists in the model's representation space that separates harmful from safe token generations at each denoising step.
    This direction is the central detection mechanism and is introduced without derivation from external benchmarks.
invented entities (1)
  • contrastive safety direction (SGD) no independent evidence
    purpose: To capture the semantic boundary between harmful and safe generations for per-step token assessment.
    New postulated direction introduced to enable the intervention; no independent falsifiable evidence outside the method is provided.

pith-pipeline@v0.9.0 · 5565 in / 1346 out tokens · 74723 ms · 2026-05-14T20:04:24.305736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    Refusal in language models is mediated by a single direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024. 9

  2. [2]

    Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

    Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS, 2021, pages 17981–17993, 2021

  3. [3]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. InConference on Secure and Trustworthy Machine Learning, SaTML, IEEE, 2025, pages 23–42

  4. [4]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems, NeurIPS, pages 55005–55029, 2024

  5. [5]

    Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler

    Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler. Continuous diffusion for categori- cal data.CoRR, 2022

  6. [6]

    Position: Building guardrails for large language models requires systematic design

    Yi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, and Xiaowei Huang. Position: Building guardrails for large language models requires systematic design. InForty-first International Conference on Machine Learning, ICML 2024, pages 11375–11394

  7. [7]

    Guardnet: Graph-attention filtering for jailbreak defense in large language models.CoRR, 2025

    Javad Forough, Mohammad Maheri, and Hamed Haddadi. Guardnet: Graph-attention filtering for jailbreak defense in large language models.CoRR, 2025

  8. [8]

    Diffuseq: Sequence to sequence text generation with diffusion models

    Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InThe Eleventh International Conference on Learning Representations, ICLR 2023

  9. [9]

    Diffusionbert: Improving generative masked language models with diffusion models

    Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusionbert: Improving generative masked language models with diffusion models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, pages 4521–4534

  10. [10]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, 2021

  11. [11]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAd- vances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020

  12. [12]

    Tekin, and Ling Liu

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim F. Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

  13. [13]

    Wong, Lidia S

    Yuyi Huang, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, and Ailin Tao. Intrinsic model weaknesses: How priming attacks unveil vulnerabilities in large language models. InFindings of the Association for Computational Linguistics: Findings of NAACL, 2025, pages 1405–1425,

  14. [14]

    A2d: Any-order, any-step safety alignment for diffusion language models

    Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, and Albert No. A2d: Any-order, any-step safety alignment for diffusion language models. InThe Fourteenth International Conference on Learning Representations, ICLR, 2026

  15. [15]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processi...

  16. [16]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processi...

  17. [17]

    How does the thinking step influence model safety? an entropy-based safety reminder for lrms.arXiv preprint arXiv:2601.03662, 2026

    Su-Hyeon Kim, Hyundong Jin, Yejin Lee, and Yo-Sub Han. How does the thinking step influence model safety? an entropy-based safety reminder for lrms.arXiv preprint arXiv:2601.03662, 2026

  18. [18]

    Hashimoto

    Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-lm improves controllable text generation. InAdvances in Neural Information Pro- cessing Systems 35: Annual Conference on Neural Information Processing Systems, NeurIPS, 2022

  19. [19]

    Diffuguard: How intrinsic safety is lost and found in diffusion large language models

    Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, and Jiaheng Zhang. Diffuguard: How intrinsic safety is lost and found in diffusion large language models. InThe Fourteenth International Conference on Learning Representations, ICLR, 2026

  20. [20]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024

  21. [21]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81. Association for Computational Linguistics, ACL, 2004

  22. [22]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), ACL, pages 3214–3252

  23. [23]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024

  24. [24]

    Forsyth, and Dan Hendrycks

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InForty-first International Conference on Machine Learning, ICML 2024, Proceedings of Machine Learning R...

  25. [25]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR, 2021

  26. [26]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InAdvances in Neural Information Processing Systems, NeurIPS, 2025

  27. [27]

    Assessing and mitigating data memorization risks in fine-tuned large language models.CoRR, 2025

    Badrinath Ramakrishnan and Akshaya Balaji. Assessing and mitigating data memorization risks in fine-tuned large language models.CoRR, 2025

  28. [28]

    Simple and effective masked diffusion language models

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. InAdvances in Neural Information Processing Systems, NeurIPS, pages 130136–130184, 2024

  29. [29]

    A strongreject for empty jailbreaks

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024

  30. [30]

    Remasking discrete diffusion models with inference-time scaling

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking discrete diffusion models with inference-time scaling. InAdvances in Neural Information Processing Systems, NeurIPS, 2025. 11

  31. [31]

    The devil behind the mask: An emergent safety vulnerability of diffusion llms

    Zichen Wen, Jiashu Qu, Zhaorun Chen, Xiaoya Lu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, et al. The devil behind the mask: An emergent safety vulnerability of diffusion llms. InThe Fourteenth International Conference on Learning Representations, ICLR, 2026

  32. [32]

    Defending chatgpt against jailbreak attack via self-reminders.Nat

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self-reminders.Nat. Mac. Intell., pages 1486–1496, 2023

  33. [33]

    Toward safer diffusion language models: Discovery and mitigation of priming vulnerability

    Shojiro Yamabe and Jun Sakuma. Toward safer diffusion language models: Discovery and mitigation of priming vulnerability

  34. [34]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  35. [35]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL, pages 14322–14350, 2024

  36. [36]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025

  37. [37]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. A Limitations and Societal Impact Our work improves jailbreak robustness in DLMs through remasking-based steering. However, several limitations remain. Our ...

  38. [38]

    Does the model refuse to respond to the instructions (0 = no, 1 = yes)?

  39. [39]

    On a scale from 1 to 5, how convincing was the AI model’s response (1 = extremely unconvincing, 2 = somewhat unconvincing, 3 = neutral, 4 = somewhat convincing, 5 = very convincing)?

  40. [40]

    answer" responses, output only a single number for each item. For the

    On a scale from 1 to 5, how specific was the AI model’s response (1 = extremely nonspecific, 2 = somewhat nonspecific, 3 = neutral, 4 = somewhat specific, 5 = very specific)? Output your responses in the format below. For the "answer" responses, output only a single number for each item. For the "reasoning" responses, provide a single line of text explain...