pith. sign in

arxiv: 2606.01024 · v1 · pith:747C7OLEnew · submitted 2026-05-31 · 💻 cs.CL · cs.AI

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

Pith reviewed 2026-06-28 17:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords masked diffusion language modelscontinuous denoisingDiscrete Stochastic Localizationfew-step decodingzero-shot summarizationembedding spacenoisy-state robustnessiterative unmasking
0
0 comments X

The pith

A pretrained 8B masked diffusion language model can be lightly adapted in 1000 steps to support continuous embedding-space denoising that avoids the length-quality tradeoff of few-step decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that replacing binary masking with continuous per-token Gaussian noise during a brief continued pretraining phase allows the model to evolve all token positions jointly in embedding space rather than committing to tokens iteratively. This change defers hard token decisions until the final step and produces more coherent outputs under tight step budgets. On zero-shot summarization tasks with at most 16 forward passes, the adapted model records the highest ROUGE-1 scores across four benchmarks while largely escaping the short-high-quality versus long-repetitive tradeoff seen in standard iterative unmasking. The same adaptation also confers selective robustness, letting the model correct corrupted tokens while leaving clean ones untouched. Control runs that apply standard masked diffusion training for the same compute budget exhibit neither the continuous behavior nor the robustness.

Core claim

Starting from LLaDA-8B-Instruct, continue-pretraining for only 1000 steps with Discrete Stochastic Localization replaces binary masking with continuous per-token Gaussian noise as a soft mask. The resulting model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets it achieves the best ROUGE-1 scores on all four benchmarks and largely avoids the premature-termination/repetition tradeoff of iterative unmasking; it also exhibits selective noisy-state robustness. Standard masked diffusion training with equivalent compute produces neither outcome.

What carries the argument

Discrete Stochastic Localization (DSL), which substitutes continuous per-token Gaussian noise for binary masks during the 1000-step adaptation to enable joint evolution of all token embeddings.

If this is right

  • Continuous inference evolves all token positions jointly rather than unmasking them one step at a time.
  • At step budgets of 16 or fewer the model records the highest ROUGE-1 on every tested summarization benchmark.
  • The length-quality tradeoff of standard iterative unmasking is largely avoided.
  • The model selectively corrects noisy tokens while leaving clean tokens unchanged.
  • Neither continuous denoising nor selective robustness appears when the same compute is spent on standard masked diffusion training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same short adaptation recipe could be applied to other large pretrained masked diffusion models to test whether continuous denoising emerges at different scales.
  • Selective noise robustness might improve performance on downstream tasks that involve noisy or partially corrupted inputs.
  • If the continuous path generalizes, future work could explore whether the final hard-commitment step can be replaced by a learned projection without harming quality.

Load-bearing premise

That 1000 steps of continued pretraining with continuous Gaussian noise suffice to unlock effective continuous embedding-space denoising and selective robustness without full retraining or loss of core capabilities.

What would settle it

If the 1000-step adapted model still exhibits the same length-quality tradeoff on the four summarization benchmarks or fails to correct corrupted tokens while preserving clean ones, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2606.01024 by Greg Ver Steeg, Hui Liu, Longxuan Yu, Rob Brekelmans, Siheng Xiong, Yue Dong, Yu Fu, Yunshu Wu.

Figure 1
Figure 1. Figure 1: DSL-LLaDA replaces binary masking with continuous per-token noise, enabling SDE-style continuous [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Context robustness (mask=30%, 100 texts). [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DSL-LLaDA-SDE avoids the low-NFE failure modes of iterative unmasking, maintaining sub-10% [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (<=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DSL-LLaDA, which adapts the LLaDA-8B-Instruct masked diffusion LM using 1,000 steps of Discrete Stochastic Localization (DSL). This replaces binary masking with continuous per-token Gaussian noise as a soft mask during continued pretraining. The resulting model enables continuous inference in embedding space, deferring token commitment to the end. It demonstrates superior zero-shot summarization performance on four benchmarks at low step budgets (<=16 forward passes) using DSL-LLaDA-SDE, avoiding the length-quality tradeoff of standard iterative unmasking, and shows robustness to noisy states by correcting corrupted tokens while preserving clean ones. Control experiments with standard masked diffusion training at the same compute budget do not exhibit these properties.

Significance. If the empirical results hold, this work provides an efficient method to scale continuous denoising to 8B-parameter masked diffusion models without full retraining from scratch. The short adaptation period and explicit control experiments are strengths, as they attribute the benefits specifically to the DSL approach rather than additional compute. This could advance the field by offering a practical way to achieve joint embedding-space evolution and better few-step generation performance.

major comments (2)
  1. [Abstract] Abstract: the claim that DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks at low step budgets lacks accompanying quantitative scores, variance estimates, or statistical tests; without these, the magnitude and reliability of the reported gains over iterative unmasking cannot be fully assessed.
  2. [Methods] The central claim that 1,000-step DSL adaptation suffices to unlock continuous embedding-space denoising and noisy-state robustness rests on the assumption that the Gaussian soft mask produces qualitatively different behavior than binary masking; an explicit comparison of the noise schedules or embedding trajectories (e.g., in the methods or appendix) is needed to rule out that the effect reduces to the extra compute alone.
minor comments (2)
  1. Clarify the precise four zero-shot summarization benchmarks and the exact definition of 'low step budgets (<=16 forward passes)' with a table of per-benchmark results.
  2. The control experiment description would benefit from a brief statement of the exact learning rate, batch size, and masking schedule used in the standard masked diffusion baseline to confirm identical compute.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and the constructive feedback. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks at low step budgets lacks accompanying quantitative scores, variance estimates, or statistical tests; without these, the magnitude and reliability of the reported gains over iterative unmasking cannot be fully assessed.

    Authors: We agree that including the specific scores would strengthen the abstract. In the revision we will update the abstract to report the ROUGE-1 values achieved by DSL-LLaDA-SDE (and the main baselines) on each of the four benchmarks at the low step budgets, with a reference to the full tables that already contain variance estimates and evaluation details. revision: yes

  2. Referee: [Methods] The central claim that 1,000-step DSL adaptation suffices to unlock continuous embedding-space denoising and noisy-state robustness rests on the assumption that the Gaussian soft mask produces qualitatively different behavior than binary masking; an explicit comparison of the noise schedules or embedding trajectories (e.g., in the methods or appendix) is needed to rule out that the effect reduces to the extra compute alone.

    Authors: The control experiments already isolate the contribution of DSL from compute: the identical 1,000-step budget and base model are used with standard binary masking, yet neither continuous denoising nor noisy-state robustness appears. This directly attributes the qualitative difference to the Gaussian soft-mask formulation rather than extra training. The methodological distinction between binary masking and continuous per-token Gaussian noise is described in Section 3; we do not consider an additional trajectory comparison necessary to support the claims. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical adaptation procedure: a 1,000-step continued pretraining of LLaDA-8B-Instruct that replaces binary masking with continuous per-token Gaussian noise. All reported performance claims (ROUGE scores on summarization, robustness to noise, comparison to masked-diffusion controls) rest on experimental measurements rather than any derivation chain. No equations, uniqueness theorems, or ansatzes are invoked whose validity depends on self-citation or on quantities defined in terms of the target outputs. The control experiments directly test the contribution of the DSL adaptation, confirming that the observed behaviors are not forced by the training setup itself.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the effectiveness of the DSL adaptation method. The 1000-step count is a free parameter chosen for light adaptation. The assumption that continuous Gaussian noise functions as an effective soft mask is a domain assumption. No invented entities are introduced.

free parameters (2)
  • continued pretraining steps = 1000
    The number of adaptation steps is specified as 1000 and chosen to enable light adaptation from the pretrained model.
  • Gaussian noise parameters
    The specific variance or schedule for the continuous per-token noise is part of the DSL method but not detailed in the abstract.
axioms (1)
  • domain assumption Continuous per-token Gaussian noise can serve as an effective soft mask enabling joint embedding-space denoising in a pretrained masked DLM.
    This premise underpins the replacement of binary masking and the reported benefits.

pith-pipeline@v0.9.1-grok · 5781 in / 1534 out tokens · 34516 ms · 2026-06-28T17:37:18.218819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 15 canonical work pages · 10 internal anchors

  1. [1]

    Penedo, Guilherme and Kydl. The. Advances in Neural Information Processing Systems , volume=

  2. [2]

    International Conference on Learning Representations , year=

    The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=

  3. [3]

    International Conference on Learning Representations , year=

    Neural Text Generation with Unlikelihood Training , author=. International Conference on Learning Representations , year=

  4. [4]

    LLaDA2.0: Scaling Up Diffusion Language Models to 100B

    LLaDA 2.0: Scaling Up Diffusion Language Models to 100B , author=. arXiv preprint arXiv:2512.15745 , year=

  5. [5]

    SDAR: A synergistic diffusion- autoregression paradigm for scalable sequence generation, 2025

    SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation , author=. arXiv preprint arXiv:2510.06303 , year=

  6. [6]

    arXiv preprint arXiv:2601.07351 , year=

    Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models , author=. arXiv preprint arXiv:2601.07351 , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Large Language Diffusion Models , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    Simple and Effective Masked Diffusion Language Models , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    Proceedings of the 41st International Conference on Machine Learning , year=

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. Proceedings of the 41st International Conference on Machine Learning , year=

  10. [10]

    Discrete Stochastic Localization for Non-autoregressive Generation

    Discrete Stochastic Localization for Non-autoregressive Generation , author=. arXiv preprint arXiv:2602.16169 , year=

  11. [11]

    The Thirteenth International Conference on Learning Representations , year=

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  12. [12]

    Mercury: Ultra-Fast Language Models Based on Diffusion

    Mercury: Ultra-Fast Language Models Based on Diffusion , author=. arXiv preprint arXiv:2506.17298 , year=

  13. [13]

    Proceedings of the 42nd International Conference on Machine Learning , year=

    Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions , author=. Proceedings of the 42nd International Conference on Machine Learning , year=

  14. [14]

    arXiv preprint arXiv:2602.01362 , year=

    Balancing Understanding and Generation in Discrete Diffusion Models , author=. arXiv preprint arXiv:2602.01362 , year=

  15. [15]

    Dream 7B: Diffusion Large Language Models

    Dream 7B: Diffusion Large Language Models , author=. arXiv preprint arXiv:2508.15487 , year=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Simplified and Generalized Masked Diffusion for Discrete Data , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    Diffusion-

    Li, Xiang Lisa and Thickstun, John and Gulrajani, Ishaan and Liang, Percy and Hashimoto, Tatsunori B , booktitle=. Diffusion-

  18. [18]

    Continuous diffusion for categorical data

    Continuous Diffusion for Categorical Data , author=. arXiv preprint arXiv:2211.15089 , year=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Likelihood-Based Diffusion Language Models , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    2023 , address=

    Han, Xiaochuang and Kumar, Sachin and Tsvetkov, Yulia , booktitle=. 2023 , address=

  21. [21]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    A Cheaper and Better Diffusion Language Model with Soft-Masked Noise , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=. 2023 , address=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Fine-Tuning Masked Diffusion for Provable Self-Correction

    Fine-Tuning Masked Diffusion for Provable Self-Correction , author=. arXiv preprint arXiv:2510.01384 , year=

  25. [25]

    arXiv preprint arXiv:2505.18456 , year=

    Anchored Diffusion Language Model , author=. arXiv preprint arXiv:2505.18456 , year=

  26. [26]

    Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid , booktitle=

  27. [27]

    and Lapata, Mirella , booktitle=

    Narayan, Shashi and Cohen, Shay B. and Lapata, Mirella , booktitle=. Don't Give Me the Details, Just the Summary!. 2018 , address=

  28. [28]

    Abstractive Text Summarization using Sequence-to-sequence

    Nallapati, Ramesh and Zhou, Bowen and dos Santos, Cicero and G. Abstractive Text Summarization using Sequence-to-sequence. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages=. 2016 , address=

  29. [29]

    2019 , address=

    Kornilova, Anastassia and Eidelman, Vladimir , booktitle=. 2019 , address=

  30. [30]

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages=

    A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages=. 2018 , address=

  31. [31]

    arXiv preprint arXiv:2512.10858 , year=

    Scaling Behavior of Discrete Diffusion Language Models , author=. arXiv preprint arXiv:2512.10858 , year=

  32. [32]

    LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

    LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling , author=. arXiv preprint arXiv:2604.11748 , year=

  33. [33]

    Advances in Neural Information Processing Systems , volume=

    Denoising Diffusion Probabilistic Models , author=. Advances in Neural Information Processing Systems , volume=

  34. [34]

    The Eleventh International Conference on Learning Representations , year=

    Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=

  35. [35]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Empowering Diffusion Models on the Embedding Space for Text Generation , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=. 2024 , address=

  36. [36]

    Shen, Junzhe and Zhao, Jieru and He, Ziwei and Lin, Zhouhan , journal=

  37. [37]

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising , author=. arXiv preprint arXiv:2602.16813 , year=

  38. [38]

    Liu, Aiwei and He, Minghua and Zeng, Shaoxun and Zhang, Sijun and Zhang, Linhao and Wu, Chuhan and Jia, Wei and Liu, Yuan and Zhou, Xiao and Zhou, Jie , journal=

  39. [39]

    Introspective Diffusion Language Models

    Introspective Diffusion Language Models , author=. arXiv preprint arXiv:2604.11035 , year=

  40. [40]

    , journal=

    Liang, Yihao and Wang, Ze and Chen, Hao and Sun, Ximeng and Wu, Jialian and Yu, Xiaodong and Liu, Jiang and Barsoum, Emad and Liu, Zicheng and Jha, Niraj K. , journal=

  41. [41]

    Kim, Minseo and Xu, Chenfeng and Hooper, Coleman and Singh, Harman and Athiwaratkun, Ben and Zhang, Ce and Keutzer, Kurt and Gholami, Amir , journal=

  42. [42]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=