arxiv: 2604.08557 · v2 · submitted 2026-03-17 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

Arth Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords diffusion language modelssafety alignmentadversarial attacktoken commitmentdenoising processtrajectory attackTrajHijack

0 comments

The pith

Diffusion language models allow committed refusal tokens to be re-masked and redirected during denoising, violating the permanence assumption that supports their safety alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that safety alignment in diffusion language models rests on the idea that tokens chosen during denoising stay fixed for the rest of the process. It demonstrates that re-masking already committed refusal tokens and adding a short affirmative prefix produces high rates of harmful outputs on standard benchmarks, and that this works on both standard and preference-optimized models without any gradient steps. The attack needs the combination of re-masking and prefix; using either alone fails, and adding gradient optimization actually lowers success. A reader would care because the result indicates that current token-commitment defenses can be undone at the trajectory level rather than at the final output.

Core claim

The paper establishes that the load-bearing assumption in dLLM safety—that committed tokens are permanent—can be violated by re-masking refusal tokens and injecting an affirmative prefix, yielding 74-82 percent attack success on HarmBench across three safety-tuned models and rising to 92-98 percent with an eight-token compliance prefix. The method, called TrajHijack, requires no gradients, generalizes across SFT and VRPO training, and reveals that re-masking alone or prefix alone produces negligible success while the pair succeeds. It further shows that the strongest published defense, A2D, becomes more vulnerable than the undefended baseline because its silent-refusal training removes the抵抗

What carries the argument

Re-masking of committed refusal tokens inside the diffusion denoising trajectory, combined with injection of an affirmative prefix to redirect the remaining steps.

If this is right

Re-masking alone or an affirmative prefix alone produces low success rates; both components must be present.
Gradient-based optimization of the attack reduces success because the resulting token distributions fall off the training manifold.
The A2D defense exhibits higher vulnerability than the undefended model because silent-refusal training removes contextual resistance that trajectory attacks must overcome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety techniques for diffusion models may need mechanisms that protect token commitments across the full denoising schedule rather than at the final output only.
Similar re-masking vulnerabilities could appear in any iterative generative process that treats early decisions as irreversible.
The Defense Inversion Effect suggests that some alignment methods may inadvertently increase exposure to trajectory-level manipulation.

Load-bearing premise

Once a token is chosen during denoising it remains fixed for all later steps.

What would settle it

Running the denoising process on a safety-tuned dLLM while forcing a refusal token to be re-masked at an intermediate step and measuring whether the final output becomes compliant.

Figures

Figures reproduced from arXiv: 2604.08557 by Arth Singh.

**Figure 1.** Figure 1: TRAJHIJACK attack pipeline. After k clean denoising steps, committed refusal tokens are re-masked and replaced with a 12-token affirmative prefix; denoising resumes to produce compliant output. No gradient computation is required. See §3.3 for why adding gradient optimization degrades ASR. academic curiosity. Concurrent work has shown that dLLM safety alignment is fragile (Wen et al., 2025; Zhang et al., … view at source ↗

read the original abstract

Safety alignment in diffusion language models (dLLMs) relies on a single load-bearing assumption: that committed tokens are permanent. We show that violating this assumption, by re-masking committed refusal tokens and injecting a short affirmative prefix, achieves 74-82% ASR on HarmBench across all three publicly available safety-tuned dLLMs, rising to 92-98% with a generic 8-token compliance prefix. We call this attack TrajHijack; it is the first trajectory-level attack on dLLMs, requires no gradient computation, and generalizes across SFT and preference-optimized (VRPO) models. Three findings emerge. First, the vulnerability is irreducibly two-component: re-masking alone (4.4%) and prefix alone (5.7%) both fail. Second, gradient optimization via a differentiable Gumbel-softmax chain consistently degrades ASR (41.5% vs. 76.1%), because continuous perturbations push token distributions off-manifold. Third, A2D (the strongest published dLLM defense) is more vulnerable to TrajHijack (89.9%) than the undefended model (76.1%): its silent-refusal training removes the contextual resistance that trajectory-level attacks must overcome, an effect we call the Defense Inversion Effect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real structural issue with token permanence in dLLM safety but the absent re-masking procedure makes the reported ASR numbers hard to trust or replicate.

read the letter

The punchline is that this paper identifies a potential weakness in safety-tuned diffusion language models by showing how to re-mask already committed refusal tokens and add a prefix to hijack the generation trajectory, achieving high attack success rates. But the missing details on how the re-masking is actually done make the results difficult to verify or reproduce. What the paper does well is to demonstrate that the attack is irreducibly two-part: doing just the re-masking or just the prefix gives low success, but combining them works much better. They also show that gradient optimization hurts performance here, which suggests that staying on the discrete token manifold is important. And the finding that the A2D defense increases vulnerability is a useful warning about how safety training can sometimes remove natural resistance to trajectory attacks. This frames the issue around the assumption of token permanence in the denoising process, which is a structural point rather than a token-level one. The main soft spot is the lack of implementation specifics. The abstract gives the high-level idea and the numbers, but there's no information on the exact timesteps for re-masking, the mask probability, or how the prefix is inserted into the partially denoised sequence. As the stress-test note points out, this leaves open whether the attack truly exploits irreversibility or relies on some other modification to the sampling process. Without error bars, model versions, or replication code, it's tough to assess how robust the 74-82% ASR claim really is. This paper is aimed at researchers focused on the safety and alignment of diffusion-based or non-autoregressive language models. Someone thinking about new attack surfaces in generative models would find the conceptual contribution worth considering, though they'd probably want to see the full methods before building on it. I would recommend sending it to peer review. The idea raises a legitimate concern that could influence how we think about safety in these models, and the review process would likely push for the necessary details to make the claims solid.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TrajHijack, a trajectory-level attack on diffusion language models (dLLMs) that violates the assumption of permanent token commitment during denoising. By re-masking committed refusal tokens and injecting a short affirmative prefix, the attack achieves 74-82% ASR on HarmBench across three safety-tuned dLLMs (rising to 92-98% with an 8-token compliance prefix). It requires no gradients, generalizes across SFT and VRPO models, and yields three findings: the vulnerability is two-component (re-masking alone or prefix alone yields low ASR), gradient optimization via Gumbel-softmax degrades performance, and the A2D defense exhibits a Defense Inversion Effect (89.9% ASR vs. 76.1% on the undefended model).

Significance. If reproducible, the result identifies a load-bearing assumption in dLLM safety alignment and demonstrates the first non-gradient trajectory attack, with direct implications for defense design. The reported Defense Inversion Effect and the failure of gradient methods are particularly noteworthy, as they suggest that standard optimization and silent-refusal training can increase rather than decrease vulnerability.

major comments (3)

[Attack procedure (abstract and methods)] The re-masking procedure lacks concrete implementation details: the exact timestep(s) at which re-masking occurs, the mask probability schedule, the mask token value, and the precise concatenation of the affirmative prefix into the partially denoised sequence are not specified. Without these, the headline ASR numbers cannot be reproduced or shown to follow from the claimed violation of token permanence rather than an unstated change to the sampling distribution.
[Results and experimental details] No experimental setup, model identifiers, sampling hyperparameters, error bars, or replication artifacts are provided for the reported ASR figures (74-82%, 41.5% vs. 76.1%, 89.9%). This absence makes the central empirical claims unverifiable from the manuscript text alone.
[Findings on A2D] The Defense Inversion Effect claim (A2D at 89.9% vs. undefended at 76.1%) attributes the increase to removal of contextual resistance, but the manuscript supplies no ablation or analysis isolating this mechanism from other differences between the defended and undefended models.

minor comments (2)

[Abstract] Define all acronyms (ASR, dLLM, SFT, VRPO, A2D) on first use in the abstract and main text.
[Introduction] Clarify the precise meaning of 'committed tokens' and 'denoising irreversibility' with a short formal statement or reference to the diffusion schedule.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting reproducibility and analytical gaps. We will revise the manuscript to incorporate detailed implementation specifications, full experimental protocols, and an additional ablation study. These changes directly address the major comments while preserving the core claims about trajectory-level attacks on dLLMs.

read point-by-point responses

Referee: The re-masking procedure lacks concrete implementation details: the exact timestep(s) at which re-masking occurs, the mask probability schedule, the mask token value, and the precise concatenation of the affirmative prefix into the partially denoised sequence are not specified.

Authors: We agree that the current manuscript omits these specifics, which are necessary for full reproducibility. In the revised version, we will add a dedicated subsection in Methods detailing: re-masking applied at timestep t=0.5T (mid-denoising), uniform mask probability over identified refusal tokens, the standard [MASK] token, and prefix concatenation by prepending the 8-token compliance string directly to the current partially denoised sequence before continuing the forward pass. This will explicitly tie the ASR gains to the violation of token permanence rather than sampling changes. revision: yes
Referee: No experimental setup, model identifiers, sampling hyperparameters, error bars, or replication artifacts are provided for the reported ASR figures (74-82%, 41.5% vs. 76.1%, 89.9%).

Authors: We acknowledge this omission limits verifiability. The revision will include: model identifiers (specific public dLLM checkpoints used), sampling details (number of denoising steps, temperature=1.0, top-p=0.9), error bars computed over 5 independent runs with standard deviation, and a link to an anonymized replication repository containing code and seeds. These additions will allow independent verification of all reported ASR values. revision: yes
Referee: The Defense Inversion Effect claim (A2D at 89.9% vs. undefended at 76.1%) attributes the increase to removal of contextual resistance, but the manuscript supplies no ablation or analysis isolating this mechanism from other differences between the defended and undefended models.

Authors: The manuscript currently relies on the direct head-to-head comparison and the known properties of silent-refusal training to attribute the effect. However, we agree an explicit isolation is warranted. We will add a new ablation experiment in the revision that trains or evaluates controlled variants differing only in contextual resistance components, confirming that the Defense Inversion Effect is driven by the removal of resistance rather than other model differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstration with direct measurements

full rationale

The paper reports an empirical attack (TrajHijack) that re-masks committed refusal tokens and injects an affirmative prefix to achieve measured ASR values (74-82% on HarmBench, higher with longer prefixes) across safety-tuned dLLMs. No derivation chain, equations, fitted parameters, or predictions appear in the abstract or described findings; the central results are presented as direct experimental outcomes rather than reductions to prior quantities or self-citations. The two-component vulnerability claim, gradient-optimization comparison, and Defense Inversion Effect are all framed as observations from runs, with no load-bearing self-referential steps that would equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that committed tokens remain fixed during denoising; no free parameters are introduced or fitted, and no new entities are postulated.

axioms (1)

domain assumption Committed tokens in the diffusion denoising process are permanent and cannot be altered after selection.
Explicitly stated as the single load-bearing assumption for safety alignment in dLLMs.

pith-pipeline@v0.9.0 · 5529 in / 1144 out tokens · 45111 ms · 2026-05-15T10:44:26.169979+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 5 internal anchors

[1]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Safer by diffusion, broken by context: Diffusion LLM’s safety blessing and its failure mode.arXiv preprint arXiv:2602.00388,

He, Z., Chen, Y ., Lin, L., Wang, Y ., Chang, S., Sommerlade, E., Torr, P., and Yu, J. Safer by diffusion, broken by context: Diffusion LLM’s safety blessing and its failure mode.arXiv preprint arXiv:2602.00388,

work page arXiv
[3]

Mercury: Ultra- fast language models based on diffusion.arXiv preprint arXiv:2506.17298,

Khanna, S., Kharbanda, S., Li, S., et al. Mercury: Ultra- fast language models based on diffusion.arXiv preprint arXiv:2506.17298,

work page arXiv
[4]

Diffusion Language Models Know the Answer Before Decoding

Li, P., Zhou, Y ., and Muhtar, D. Diffusion language mod- els know the answer before decoding.arXiv preprint arXiv:2508.19982, 2025a. Li, Z., Nie, Z., Zhou, Z., Guo, Y ., Liu, Y ., Zhang, Y ., Cheng, Y ., Wen, Q., Wang, K., and Zhang, J. DiffuGuard: How intrinsic safety is lost and found in diffusion large lan- guage models.arXiv preprint arXiv:2509.24296...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

Shnaidman, A., Feiglin, E., Yaari, O., et al. Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

work page arXiv
[6]

The devil behind the mask: An emergent safety vulnerability of diffusion LLMs.arXiv preprint arXiv:2507.11097,

Wen, Z., Qu, J., Chen, Z., Lu, X., Liu, D., et al. The devil behind the mask: An emergent safety vulnerability of diffusion LLMs.arXiv preprint arXiv:2507.11097,

work page arXiv
[7]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Jailbreaking large language diffusion mod- els: Revealing hidden safety flaws in diffusion-based text generation.arXiv preprint arXiv:2507.19227,

Zhang, Y ., Xie, F., Zhou, Z., Li, Z., Chen, H., Wang, K., and Guo, Y . Jailbreaking large language diffusion mod- els: Revealing hidden safety flaws in diffusion-based text generation.arXiv preprint arXiv:2507.19227,

work page arXiv
[9]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Zhu, F., Wang, R., Nie, S., Zhang, X., Wu, C., et al. LLaDA 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Give instructions for X

Black-box (input) HarmBench / GPT-4o Structured benign context✓ TRAJHIJACK(ours) White-box (stepk) HarmBench / Claude S4 12-token rule-based prefix✓ Table 9.Structural comparison with concurrent attacks on dLLMs. ASR is omitted because each method uses a different benchmark and judge with no cross-calibration. Our core attack is the simplest trajectory-le...

work page 2026