arxiv: 2604.18471 · v2 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin

Pith reviewed 2026-05-10 05:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords discrete diffusion language modelssampling order optimizationneural indicatortrajectory-preserving objectiveaccelerationtoken sampling

0 comments

The pith

A neural indicator picks correct tokens early to cut discrete diffusion sampling steps by an order of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Discrete diffusion language models generate tokens in flexible orders, yet standard sampling still requires many steps because it does not prioritize tokens the model has already predicted correctly. The paper shows that deliberately sampling those correct tokens first can shrink the total number of iterations dramatically while keeping final accuracy intact. To achieve this, the authors train a small neural indicator with a trajectory-preserving objective that learns to identify which tokens are ready to be fixed at each step. Experiments confirm the method delivers large speedups on existing dLLM architectures across benchmarks.

Core claim

We find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this observation, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that uses a neural indicator to decide which tokens should be sampled at each step, trained via a novel trajectory-preserving objective.

What carries the argument

Neural indicator trained with a trajectory-preserving objective to select which tokens to unmask early

Load-bearing premise

The neural indicator will identify correctly predicted tokens early enough to save steps without adding new errors or so much extra computation that the net speedup disappears.

What would settle it

Running NI Sampling on the same LLaDA or Dream benchmarks but observing either a noticeable accuracy drop or no reduction in required steps compared with full-step sampling.

Figures

Figures reproduced from arXiv: 2604.18471 by Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin.

**Figure 1.** Figure 1: Comparison between NI Sampling and previous sampling methods. At each sampling step, NI Sampling uses a neural indicator to assign scores to masked positions. Tokens with sufficiently high indicator scores are then sampled. Masked Diffusion Models. Among the various design choices for discrete diffusion model, setting pT as a delta distribution on the mask token [MASK] and using pθ(x0|xt) to parameterize … view at source ↗

**Figure 2.** Figure 2: An example of Trajectory-Preserving-Order. At step 1, tokens A, B, and C are predicted correctly and follow the order of the reference trajectory, so they can be sampled. At step 2, token D, which is the next token in the reference trajectory, can be sampled. Although token F is predicted correctly, it cannot be sampled because its preceding token in the reference trajectory (E) is incorrect. At step 3, to… view at source ↗

**Figure 3.** Figure 3: Generating training data for the indicator. At each step of the reference trajectory, labels [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Trade-off curves between accuracy and steps with LLaDA models. More can be found in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of different training data dis [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Performance-step trade-off curves of LLaDA-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Performance-step trade-off curves of LLaDA-1.5. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Performance-step trade-off curves of Dream-7B-Base. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Performance-inference time trade-off curves of Dream-7B-Base. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Performance-inference time trade-off curves of LLaDA-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Performance-inference time trade-off curves of LLaDA-1.5. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Performance-inference time trade-off curves of Dream-7B-Base. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Ablation results on parameter size of the indicator. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Ablation results on training set size [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

read the original abstract

Discrete diffusion language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive approaches, offering the flexibility to generate tokens in arbitrary orders and the potential of parallel decoding. However, existing heuristic sampling strategies remain inefficient: they choose only a small part of tokens to sample at each step, leaving substantial room for improvement. In this work, we study the problem of token sampling order optimization and demonstrate its significant potential for acceleration. Specifically, we find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that utilize a neural indicator to decide which tokens should be sampled at each step. We further propose a novel trajectory-preserving objective to train the indicator. Experiments on LLaDA and Dream models across multiple benchmarks show that our method achieves up to 14.3$\times$ acceleration over full-step sampling with negligible performance drop, and consistently outperforms confidence threshold sampling in the accuracy-step trade-off. Code is available at https://github.com/imagination-research/NI-Sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NI Sampling trains a neural indicator to pick early tokens in discrete diffusion, delivering big step cuts on LLaDA and Dream but leaving the net compute cost unclear.

read the letter

The main point is that this paper gives a practical way to cut sampling steps in discrete diffusion language models by learning which tokens to unmask first. They train a small neural indicator with a trajectory-preserving loss so its decisions stay aligned with full paths, then use it to mask fewer tokens per iteration. On the tested models this yields up to 14 times fewer steps than full sampling and beats simple confidence thresholds, with code released for inspection.

Referee Report

2 major / 1 minor

Summary. The paper introduces Neural Indicator Sampling (NI Sampling) for discrete diffusion language models (dLLMs). It claims that a neural network trained via a trajectory-preserving objective can optimize token sampling order by identifying correct tokens early, reducing sampling iterations by up to an order of magnitude. Experiments on LLaDA and Dream models report up to 14.3× acceleration over full-step sampling with negligible accuracy drop and better accuracy-step trade-offs than confidence-threshold baselines. Code is released.

Significance. If the net wall-clock and FLOPs speedups hold after accounting for indicator overhead, the work could meaningfully advance practical deployment of dLLMs by making parallel decoding far more efficient. The empirical focus on trajectory preservation and the public code release are positive for reproducibility.

major comments (2)

[abstract and §4] The central speedup claim (abstract and §4) is reported in iteration count (up to 14.3× fewer steps) but provides no per-step FLOPs breakdown or wall-clock measurements that subtract the cost of the additional forward pass through the neural indicator. If the indicator is even 10-20% the size of the base dLLM, the net acceleration may be substantially lower than claimed; this is load-bearing for the practical contribution.
[§3.2] §3.2 (trajectory-preserving objective): the training loss is defined on full trajectories, but it is not shown that the learned indicator avoids introducing new errors that propagate in the reduced-step regime. The weakest-assumption note in the review highlights this; an ablation on error rates when early sampling decisions are forced would be needed to support the 'negligible performance drop' claim.

minor comments (1)

[§3.1] Notation for the indicator network (e.g., how its output mask is combined with the diffusion model's logits) could be clarified with a small diagram or explicit equation in §3.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [abstract and §4] The central speedup claim (abstract and §4) is reported in iteration count (up to 14.3× fewer steps) but provides no per-step FLOPs breakdown or wall-clock measurements that subtract the cost of the additional forward pass through the neural indicator. If the indicator is even 10-20% the size of the base dLLM, the net acceleration may be substantially lower than claimed; this is load-bearing for the practical contribution.

Authors: We agree that iteration count alone does not fully capture practical acceleration and that a FLOPs/wall-clock analysis accounting for indicator overhead is necessary to substantiate the contribution. The manuscript emphasizes iteration reduction because it is the direct result of the token-order optimization, but we acknowledge the referee's point that net efficiency must be demonstrated. In the revised manuscript we will add a per-step FLOPs breakdown (base model vs. indicator) together with wall-clock timings on standard hardware, allowing readers to compute the true net speedup. revision: yes
Referee: [§3.2] §3.2 (trajectory-preserving objective): the training loss is defined on full trajectories, but it is not shown that the learned indicator avoids introducing new errors that propagate in the reduced-step regime. The weakest-assumption note in the review highlights this; an ablation on error rates when early sampling decisions are forced would be needed to support the 'negligible performance drop' claim.

Authors: The trajectory-preserving objective is explicitly constructed to align the indicator's decisions with the tokens that would have been correctly sampled under the full trajectory, thereby limiting deviation from the original path. The reported experiments already show that final accuracy remains essentially unchanged despite the large reduction in steps, which indirectly supports limited error propagation. Nevertheless, we accept that a direct ablation measuring error rates under forced early decisions would provide stronger evidence. We will add this ablation to the revised §3.2 and §4. revision: yes

Circularity Check

0 steps flagged

Empirical training and measurement; no circular derivation

full rationale

The paper's core contribution is an empirical framework: a neural indicator is trained on full trajectories using a proposed objective, then used at inference to select tokens for early sampling. Acceleration results (up to 14.3×) are obtained from direct benchmarks on LLaDA and Dream models rather than any closed-form derivation or parameter fit that is later renamed as a prediction. No equations reduce the indicator output or speedup claim to a quantity defined in terms of itself, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily high-level and incomplete. The central addition is the learned neural indicator and its training objective.

free parameters (1)

neural indicator network hyperparameters
Architecture size, learning rate, and training schedule for the indicator network are chosen and fitted during development.

axioms (1)

domain assumption A neural network can be trained to predict which tokens will be correctly generated at the current diffusion step.
This is the core premise enabling the indicator to accelerate sampling.

invented entities (1)

Neural Indicator no independent evidence
purpose: Decides which tokens should be sampled at each diffusion step.
New component introduced by the paper.

pith-pipeline@v0.9.0 · 5503 in / 1334 out tokens · 65985 ms · 2026-05-10T05:53:40.430596+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 26 canonical work pages · 6 internal anchors

[1]

Advances in neural information processing systems , volume=

Structured denoising diffusion models in discrete state-spaces , author=. Advances in neural information processing systems , volume=
[2]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Discrete diffusion modeling by estimating the ratios of the data distribution , author=. arXiv preprint arXiv:2310.16834 , year=

work page internal anchor Pith review arXiv
[3]

Advances in Neural Information Processing Systems , volume=

Simple and effective masked diffusion language models , author=. Advances in Neural Information Processing Systems , volume=
[4]

Large Language Diffusion Models

Large language diffusion models , author=. arXiv preprint arXiv:2502.09992 , year=

work page internal anchor Pith review arXiv
[5]

Dream 7B: Diffusion Large Language Models

Dream 7b: Diffusion large language models , author=. arXiv preprint arXiv:2508.15487 , year=

work page internal anchor Pith review arXiv
[6]

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models , author=. arXiv preprint arXiv:2505.19223 , year=

work page internal anchor Pith review arXiv
[7]

Train for the worst, plan for the best: Understanding token ordering in masked diffusions.arXiv preprint arXiv:2502.06768,

Train for the worst, plan for the best: Understanding token ordering in masked diffusions , author=. arXiv preprint arXiv:2502.06768 , year=

work page arXiv
[8]

Mercury: Ultra-fast language models based on diffusion, 2025

Mercury: Ultra-fast language models based on diffusion , author=. arXiv preprint arXiv:2506.17298 , year=

work page arXiv
[9]

Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193,

Seed diffusion: A large-scale diffusion language model with high-speed inference , author=. arXiv preprint arXiv:2508.02193 , year=

work page arXiv
[10]

Score-Based Generative Modeling through Stochastic Differential Equations

Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2011
[11]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
[12]

International conference on machine learning , pages=

Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=

2015
[13]

arXiv preprint arXiv:2412.17153 , year=

Distilled decoding 1: One-step sampling of image auto-regressive models with flow matching , author=. arXiv preprint arXiv:2412.17153 , year=

work page arXiv
[14]

arXiv preprint arXiv:2510.21003 , year=

Distilled decoding 2: One-step sampling of image auto-regressive models with conditional score distillation , author=. arXiv preprint arXiv:2510.21003 , year=

work page arXiv
[15]

Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618, 2025

Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding , author=. arXiv preprint arXiv:2505.22618 , year=

work page arXiv
[16]

Advances in Neural Information Processing Systems , volume=

A continuous time framework for discrete denoising models , author=. Advances in Neural Information Processing Systems , volume=
[17]

Advances in Neural Information Processing Systems , volume=

Concrete score matching: Generalized score matching for discrete data , author=. Advances in Neural Information Processing Systems , volume=
[18]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Maskgit: Masked generative image transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[19]

Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,

Block diffusion: Interpolating between autoregressive and diffusion language models , author=. arXiv preprint arXiv:2503.09573 , year=

work page arXiv
[20]

Advances in neural information processing systems , volume=

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps , author=. Advances in neural information processing systems , volume=
[21]

Machine Intelligence Research , pages=

Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models , author=. Machine Intelligence Research , pages=. 2025 , publisher=

2025
[22]

Advances in Neural Information Processing Systems , volume=

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models , author=. Advances in Neural Information Processing Systems , volume=
[23]

arXiv preprint arXiv:2204.13902 , year=

Fast Sampling of Diffusion Models with Exponential Integrator , author=. arXiv preprint arXiv:2204.13902 , year=

work page arXiv
[24]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023a

dllm-cache: Accelerating diffusion large language models with adaptive caching , author=. arXiv preprint arXiv:2506.06295 , year=

work page arXiv
[25]

The Twelfth International Conference on Learning Representations , year=

A unified sampling framework for solver searching of diffusion probabilistic models , author=. The Twelfth International Conference on Learning Representations , year=
[26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Fast ode-based sampling for diffusion models in around 5 steps , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[27]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[28]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Medusa: Simple llm inference acceleration framework with multiple decoding heads , author=. arXiv preprint arXiv:2401.10774 , year=

work page internal anchor Pith review arXiv
[29]

Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024

Eagle: Speculative sampling requires rethinking feature uncertainty , author=. arXiv preprint arXiv:2401.15077 , year=

work page arXiv
[30]

Accelerating auto-regressive text-to-image generation with training-free speculative jacobi decoding.arXiv preprint arXiv:2410.01699,

Accelerating auto-regressive text-to-image generation with training-free speculative jacobi decoding , author=. arXiv preprint arXiv:2410.01699 , year=

work page arXiv
[31]

EAGLE-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

Eagle-2: Faster inference of language models with dynamic draft trees , author=. arXiv preprint arXiv:2406.16858 , year=

work page arXiv
[32]

Eagle-3: Scaling up inference ac- celeration of large language models via training-time test.arXiv preprint arXiv:2503.01840,

Eagle-3: Scaling up inference acceleration of large language models via training-time test , author=. arXiv preprint arXiv:2503.01840 , year=

work page arXiv
[33]

P., Wang, H., and Yao, H

CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing , author=. arXiv preprint arXiv:2502.01976 , year=

work page arXiv
[34]

R2r: Efficiently navigating divergent reasoning paths with small-large model token routing.arXiv preprint arXiv:2505.21600, 2025a

R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing , author=. arXiv preprint arXiv:2505.21600 , year=

work page arXiv
[35]

Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514,

Scaling up masked diffusion models on text , author=. arXiv preprint arXiv:2410.18514 , year=

work page arXiv
[36]

Advances in neural information processing systems , volume=

Argmax flows and multinomial diffusion: Learning categorical distributions , author=. Advances in neural information processing systems , volume=
[37]

Hemmat, A., Torr, P., Chen, Y ., and Yu, J

Diffusionbert: Improving generative masked language models with diffusion models , author=. arXiv preprint arXiv:2211.15029 , year=

work page arXiv
[38]

J., and Neubig, G

Diffuser: Discrete diffusion via edit-based reconstruction , author=. arXiv preprint arXiv:2210.16886 , year=

work page arXiv
[39]

Score-based continuous-time discrete diffusion models

Score-based continuous-time discrete diffusion models , author=. arXiv preprint arXiv:2211.16750 , year=

work page arXiv
[40]

Autoregressive Image Generation Without Vector Quantization

Autoregressive Image Generation without Vector Quantization , author=. arXiv preprint arXiv:2406.11838 , year=

work page arXiv
[41]

European Conference on Computer Vision , pages=

Improved masked image generation with token-critic , author=. European Conference on Computer Vision , pages=. 2022 , organization=

2022
[42]

The Eleventh International Conference on Learning Representations , year=

Discrete predictor-corrector diffusion models for image synthesis , author=. The Eleventh International Conference on Learning Representations , year=
[43]

The 41st international ACM SIGIR conference on research & development in information retrieval , pages=

Texygen: A benchmarking platform for text generation models , author=. The 41st international ACM SIGIR conference on research & development in information retrieval , pages=
[44]

Advances in Neural Information Processing Systems , volume=

Mauve: Measuring the gap between neural text and human text using divergence frontiers , author=. Advances in Neural Information Processing Systems , volume=
[45]

Advances in neural information processing systems , volume=

Simplified and generalized masked diffusion for discrete data , author=. Advances in neural information processing systems , volume=