pith. machine review for the scientific record. sign in

arxiv: 2604.08557 · v2 · submitted 2026-03-17 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords diffusion language modelssafety alignmentadversarial attacktoken commitmentdenoising processtrajectory attackTrajHijack
0
0 comments X

The pith

Diffusion language models allow committed refusal tokens to be re-masked and redirected during denoising, violating the permanence assumption that supports their safety alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that safety alignment in diffusion language models rests on the idea that tokens chosen during denoising stay fixed for the rest of the process. It demonstrates that re-masking already committed refusal tokens and adding a short affirmative prefix produces high rates of harmful outputs on standard benchmarks, and that this works on both standard and preference-optimized models without any gradient steps. The attack needs the combination of re-masking and prefix; using either alone fails, and adding gradient optimization actually lowers success. A reader would care because the result indicates that current token-commitment defenses can be undone at the trajectory level rather than at the final output.

Core claim

The paper establishes that the load-bearing assumption in dLLM safety—that committed tokens are permanent—can be violated by re-masking refusal tokens and injecting an affirmative prefix, yielding 74-82 percent attack success on HarmBench across three safety-tuned models and rising to 92-98 percent with an eight-token compliance prefix. The method, called TrajHijack, requires no gradients, generalizes across SFT and VRPO training, and reveals that re-masking alone or prefix alone produces negligible success while the pair succeeds. It further shows that the strongest published defense, A2D, becomes more vulnerable than the undefended baseline because its silent-refusal training removes the抵抗

What carries the argument

Re-masking of committed refusal tokens inside the diffusion denoising trajectory, combined with injection of an affirmative prefix to redirect the remaining steps.

If this is right

  • Re-masking alone or an affirmative prefix alone produces low success rates; both components must be present.
  • Gradient-based optimization of the attack reduces success because the resulting token distributions fall off the training manifold.
  • The A2D defense exhibits higher vulnerability than the undefended model because silent-refusal training removes contextual resistance that trajectory attacks must overcome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety techniques for diffusion models may need mechanisms that protect token commitments across the full denoising schedule rather than at the final output only.
  • Similar re-masking vulnerabilities could appear in any iterative generative process that treats early decisions as irreversible.
  • The Defense Inversion Effect suggests that some alignment methods may inadvertently increase exposure to trajectory-level manipulation.

Load-bearing premise

Once a token is chosen during denoising it remains fixed for all later steps.

What would settle it

Running the denoising process on a safety-tuned dLLM while forcing a refusal token to be re-masked at an intermediate step and measuring whether the final output becomes compliant.

Figures

Figures reproduced from arXiv: 2604.08557 by Arth Singh.

Figure 1
Figure 1. Figure 1: TRAJHIJACK attack pipeline. After k clean denoising steps, committed refusal tokens are re-masked and replaced with a 12-token affirmative prefix; denoising resumes to produce compli￾ant output. No gradient computation is required. See §3.3 for why adding gradient optimization degrades ASR. academic curiosity. Concurrent work has shown that dLLM safety alignment is fragile (Wen et al., 2025; Zhang et al., … view at source ↗
read the original abstract

Safety alignment in diffusion language models (dLLMs) relies on a single load-bearing assumption: that committed tokens are permanent. We show that violating this assumption, by re-masking committed refusal tokens and injecting a short affirmative prefix, achieves 74-82% ASR on HarmBench across all three publicly available safety-tuned dLLMs, rising to 92-98% with a generic 8-token compliance prefix. We call this attack TrajHijack; it is the first trajectory-level attack on dLLMs, requires no gradient computation, and generalizes across SFT and preference-optimized (VRPO) models. Three findings emerge. First, the vulnerability is irreducibly two-component: re-masking alone (4.4%) and prefix alone (5.7%) both fail. Second, gradient optimization via a differentiable Gumbel-softmax chain consistently degrades ASR (41.5% vs. 76.1%), because continuous perturbations push token distributions off-manifold. Third, A2D (the strongest published dLLM defense) is more vulnerable to TrajHijack (89.9%) than the undefended model (76.1%): its silent-refusal training removes the contextual resistance that trajectory-level attacks must overcome, an effect we call the Defense Inversion Effect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TrajHijack, a trajectory-level attack on diffusion language models (dLLMs) that violates the assumption of permanent token commitment during denoising. By re-masking committed refusal tokens and injecting a short affirmative prefix, the attack achieves 74-82% ASR on HarmBench across three safety-tuned dLLMs (rising to 92-98% with an 8-token compliance prefix). It requires no gradients, generalizes across SFT and VRPO models, and yields three findings: the vulnerability is two-component (re-masking alone or prefix alone yields low ASR), gradient optimization via Gumbel-softmax degrades performance, and the A2D defense exhibits a Defense Inversion Effect (89.9% ASR vs. 76.1% on the undefended model).

Significance. If reproducible, the result identifies a load-bearing assumption in dLLM safety alignment and demonstrates the first non-gradient trajectory attack, with direct implications for defense design. The reported Defense Inversion Effect and the failure of gradient methods are particularly noteworthy, as they suggest that standard optimization and silent-refusal training can increase rather than decrease vulnerability.

major comments (3)
  1. [Attack procedure (abstract and methods)] The re-masking procedure lacks concrete implementation details: the exact timestep(s) at which re-masking occurs, the mask probability schedule, the mask token value, and the precise concatenation of the affirmative prefix into the partially denoised sequence are not specified. Without these, the headline ASR numbers cannot be reproduced or shown to follow from the claimed violation of token permanence rather than an unstated change to the sampling distribution.
  2. [Results and experimental details] No experimental setup, model identifiers, sampling hyperparameters, error bars, or replication artifacts are provided for the reported ASR figures (74-82%, 41.5% vs. 76.1%, 89.9%). This absence makes the central empirical claims unverifiable from the manuscript text alone.
  3. [Findings on A2D] The Defense Inversion Effect claim (A2D at 89.9% vs. undefended at 76.1%) attributes the increase to removal of contextual resistance, but the manuscript supplies no ablation or analysis isolating this mechanism from other differences between the defended and undefended models.
minor comments (2)
  1. [Abstract] Define all acronyms (ASR, dLLM, SFT, VRPO, A2D) on first use in the abstract and main text.
  2. [Introduction] Clarify the precise meaning of 'committed tokens' and 'denoising irreversibility' with a short formal statement or reference to the diffusion schedule.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting reproducibility and analytical gaps. We will revise the manuscript to incorporate detailed implementation specifications, full experimental protocols, and an additional ablation study. These changes directly address the major comments while preserving the core claims about trajectory-level attacks on dLLMs.

read point-by-point responses
  1. Referee: The re-masking procedure lacks concrete implementation details: the exact timestep(s) at which re-masking occurs, the mask probability schedule, the mask token value, and the precise concatenation of the affirmative prefix into the partially denoised sequence are not specified.

    Authors: We agree that the current manuscript omits these specifics, which are necessary for full reproducibility. In the revised version, we will add a dedicated subsection in Methods detailing: re-masking applied at timestep t=0.5T (mid-denoising), uniform mask probability over identified refusal tokens, the standard [MASK] token, and prefix concatenation by prepending the 8-token compliance string directly to the current partially denoised sequence before continuing the forward pass. This will explicitly tie the ASR gains to the violation of token permanence rather than sampling changes. revision: yes

  2. Referee: No experimental setup, model identifiers, sampling hyperparameters, error bars, or replication artifacts are provided for the reported ASR figures (74-82%, 41.5% vs. 76.1%, 89.9%).

    Authors: We acknowledge this omission limits verifiability. The revision will include: model identifiers (specific public dLLM checkpoints used), sampling details (number of denoising steps, temperature=1.0, top-p=0.9), error bars computed over 5 independent runs with standard deviation, and a link to an anonymized replication repository containing code and seeds. These additions will allow independent verification of all reported ASR values. revision: yes

  3. Referee: The Defense Inversion Effect claim (A2D at 89.9% vs. undefended at 76.1%) attributes the increase to removal of contextual resistance, but the manuscript supplies no ablation or analysis isolating this mechanism from other differences between the defended and undefended models.

    Authors: The manuscript currently relies on the direct head-to-head comparison and the known properties of silent-refusal training to attribute the effect. However, we agree an explicit isolation is warranted. We will add a new ablation experiment in the revision that trains or evaluates controlled variants differing only in contextual resistance components, confirming that the Defense Inversion Effect is driven by the removal of resistance rather than other model differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstration with direct measurements

full rationale

The paper reports an empirical attack (TrajHijack) that re-masks committed refusal tokens and injects an affirmative prefix to achieve measured ASR values (74-82% on HarmBench, higher with longer prefixes) across safety-tuned dLLMs. No derivation chain, equations, fitted parameters, or predictions appear in the abstract or described findings; the central results are presented as direct experimental outcomes rather than reductions to prior quantities or self-citations. The two-component vulnerability claim, gradient-optimization comparison, and Defense Inversion Effect are all framed as observations from runs, with no load-bearing self-referential steps that would equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that committed tokens remain fixed during denoising; no free parameters are introduced or fitted, and no new entities are postulated.

axioms (1)
  • domain assumption Committed tokens in the diffusion denoising process are permanent and cannot be altered after selection.
    Explicitly stated as the single load-bearing assumption for safety alignment in dLLMs.

pith-pipeline@v0.9.0 · 5529 in / 1144 out tokens · 45111 ms · 2026-05-15T10:44:26.169979+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  2. [2]

    Safer by diffusion, broken by context: Diffusion LLM’s safety blessing and its failure mode.arXiv preprint arXiv:2602.00388,

    He, Z., Chen, Y ., Lin, L., Wang, Y ., Chang, S., Sommerlade, E., Torr, P., and Yu, J. Safer by diffusion, broken by context: Diffusion LLM’s safety blessing and its failure mode.arXiv preprint arXiv:2602.00388,

  3. [3]

    Mercury: Ultra- fast language models based on diffusion.arXiv preprint arXiv:2506.17298,

    Khanna, S., Kharbanda, S., Li, S., et al. Mercury: Ultra- fast language models based on diffusion.arXiv preprint arXiv:2506.17298,

  4. [4]

    Diffusion Language Models Know the Answer Before Decoding

    Li, P., Zhou, Y ., and Muhtar, D. Diffusion language mod- els know the answer before decoding.arXiv preprint arXiv:2508.19982, 2025a. Li, Z., Nie, Z., Zhou, Z., Guo, Y ., Liu, Y ., Zhang, Y ., Cheng, Y ., Wen, Q., Wang, K., and Zhang, J. DiffuGuard: How intrinsic safety is lost and found in diffusion large lan- guage models.arXiv preprint arXiv:2509.24296...

  5. [5]

    Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

    Shnaidman, A., Feiglin, E., Yaari, O., et al. Activation steering for masked diffusion language models.arXiv preprint arXiv:2512.24143,

  6. [6]

    The devil behind the mask: An emergent safety vulnerability of diffusion LLMs.arXiv preprint arXiv:2507.11097,

    Wen, Z., Qu, J., Chen, Z., Lu, X., Liu, D., et al. The devil behind the mask: An emergent safety vulnerability of diffusion LLMs.arXiv preprint arXiv:2507.11097,

  7. [7]

    Dream 7B: Diffusion Large Language Models

    Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

  8. [8]

    Jailbreaking large language diffusion mod- els: Revealing hidden safety flaws in diffusion-based text generation.arXiv preprint arXiv:2507.19227,

    Zhang, Y ., Xie, F., Zhou, Z., Li, Z., Chen, H., Wang, K., and Guo, Y . Jailbreaking large language diffusion mod- els: Revealing hidden safety flaws in diffusion-based text generation.arXiv preprint arXiv:2507.19227,

  9. [9]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Zhu, F., Wang, R., Nie, S., Zhang, X., Wu, C., et al. LLaDA 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223,

  10. [10]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

  11. [11]

    Give instructions for X

    Black-box (input) HarmBench / GPT-4o Structured benign context✓ TRAJHIJACK(ours) White-box (stepk) HarmBench / Claude S4 12-token rule-based prefix✓ Table 9.Structural comparison with concurrent attacks on dLLMs. ASR is omitted because each method uses a different benchmark and judge with no cross-calibration. Our core attack is the simplest trajectory-le...