pith. sign in

arxiv: 2605.31215 · v1 · pith:Z6LCDAHZnew · submitted 2026-05-29 · 💻 cs.LG · cs.CV

Fixed-Point Masked Generative Modeling

Pith reviewed 2026-06-28 23:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords masked generative modelsfixed-point iterationadaptive computationcross-step consistencyefficient traininggenerative modelingdenoising
0
0 comments X

The pith

Fixed-point solvers over shared attention layers let masked generative models adapt depth and cut parameters while raising low-budget quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces fixed-depth denoisers in masked generative models with a fixed-point solver that iterates over shared attention layers. This change supports adaptive computation depth using fewer total parameters. A cross-step consistency loss aligns representations across denoising steps, and three-state reuse warm-starts each solve by distinguishing unchanged, still-masked, and newly revealed tokens. The resulting CoFRe framework trains faster, uses less memory, and produces better samples than prior masked models when the sampling budget is limited. Pre-trained masked models can also be turned into the new form with brief fine-tuning rather than full retraining.

Core claim

Fixed-Point Masked Generative Models replace part of the denoiser with a fixed-point solver over shared attention layers, augmented by a cross-step consistency loss and three-state reuse, to achieve adaptive depth, fewer parameters, lower training cost, and stronger performance under restricted sampling budgets across text and image tasks.

What carries the argument

Fixed-point solver over shared attention layers that iterates to convergence instead of using a fixed number of steps, combined with cross-step consistency loss and three-state reuse for stability in masked generation.

If this is right

  • Training time and VRAM drop substantially while generative quality rises at fixed low budgets.
  • Pre-trained masked models convert to the fixed-point form via short fine-tuning without full retraining.
  • The same pattern improves both text perplexity and image FID scores when compute is constrained.
  • Parameter count falls by roughly 39 percent while the model still outperforms the baseline at the same forward-pass budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reuse pattern may extend to other iterative refinement processes that currently fix the number of steps per sample.
  • Adaptive depth could reduce energy use on edge devices where total forward passes are the main cost driver.
  • The consistency loss might stabilize training when sequence lengths grow beyond current test regimes.

Load-bearing premise

The fixed-point solver over the shared layers converges reliably and the added losses and reuse do not create instability or bias that would erase the reported quality gains.

What would settle it

A controlled run in which raising the allowed fixed-point iterations fails to improve or actively harms sample quality at every tested low budget would falsify the benefit of the adaptive solver.

Figures

Figures reproduced from arXiv: 2605.31215 by Alba Carballo-Castro, Andrea Miele, Justin Deschenaux, Pascal Frossard, Yiming Qin.

Figure 1
Figure 1. Figure 1: FP-MDLM and CoFRe improve the quality–cost trade-off on OWT. (Left) Generative perplexity across forward-pass budgets, with entropy in parentheses; CoFRe gives the best quality at all shown budgets. (Right) Relative to MDLM, FP-MDLM and CoFRe use fewer parameters, less training time, and less VRAM. 1 Introduction Masked generative models (MGMs) generate sequences by iteratively denoising masked tokens, en￾… view at source ↗
Figure 2
Figure 2. Figure 2: Training and sampling for fixed-point masked generative models. (Left) During training, FP-MGMs keep the masked modeling objective while replacing the middle transformer stack with an iterated shared fixed-point block. For cross-step consistency, correlated masks from the same clean sequence define a noisier student state and cleaner teacher state (tc < ts); the model is trained with the base cross-entropy… view at source ↗
Figure 3
Figure 3. Figure 3: Token transition type determines how reusable fixed-point states are. Newly revealed tokens move much more than stable to￾kens, motivating strong reuse for visible tokens, partial reuse for masked tokens, and weak reuse for newly revealed tokens. However, full reuse applies the same initialization rule to all positions, implicitly assuming that the previous fixed-point solution remains equally well aligned… view at source ↗
Figure 4
Figure 4. Figure 4: Short from-scratch adaptation im￾proves FP-MDLM on OWT. Adapted FP￾MDLM improves generation quality at every budget tested. The adaptation also improves the behaviour of reuse. For the baseline FP-MDLM, reuse is incon￾sistent and can hurt generation quality at larger budgets. After adaptation, however, reuse be￾comes beneficial in the medium- and high-budget regimes: both full reuse and three-state reuse i… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of different warm-start of the fixed-point on FP-MDLM base (Left) and FP￾MDLM+LCONS (Right). We isolate two design choices that are not covered by the main end-to-end results: two main ingredi￾ents of CoFRe: three-state reuse and consistency loss, and the pretrained layer initialization used dur￾ing checkpoint adaptation. Each ablation changes only the component under study while keeping the trainin… view at source ↗
Figure 6
Figure 6. Figure 6: Representation similarity in a pretrained MDLM. We compute Linear CKA between residual-stream activations of all transformer layers at timesteps t ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, then average the similarities across timesteps. The heatmap shows a clear two-stage structure: layers 1–5 form an early block, layers 6–12 form a highly self-similar late block, and cross-block similarity is low. The consecutive-laye… view at source ↗
Figure 7
Figure 7. Figure 7: Adapting an MDLM checkpoint into FP-MDLM. (Left) FP-MDLM is initialized by mapping layers from a pretrained MDLM checkpoint to the preprocessing, fixed-point, and postprocessing blocks (see Appendix C.3.1). (Right) we then run a short adaptation stage with a teacher–student KL loss on logits, using correlated masks at two nearby noise levels, where the teacher input is less noisy than the student input. C.… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of the training loss between no initialization and initialization with pretrained [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Generative perplexity as a function of sampling budget on OpenWebText. We compare CoFRe against PGM 6/6, PGM 6/6 with SDTT, and IDLM-MDLM [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Generated samples using CoFRe with a budget of 460. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Lagged logit analysis. (Left) output-head-projected hidden-state changes decrease as the number of sampling steps increases, for both the baseline and the LCONS model. (Right) relative reduction in lagged logit KL from LCONS compared to the baseline, measured between a student denoising step s and a cleaner future step s + ℓ. The consistency-trained model reduces lagged logit KL across lags and sampling-s… view at source ↗
Figure 12
Figure 12. Figure 12: Checkpoint selection for the LCONS post-training stage. (Left) Generative perplexity across budgets improves rapidly during early consistency training, but later checkpoints over-sharpen the model, as reflected by collapsing entropy values shown in parentheses. Sampling is done without warm-start (e.g. no reuse) (Right) Validation perplexity is not monotonic: it first rises above the 15% threshold, then c… view at source ↗
Figure 13
Figure 13. Figure 13: Tradeoff between denoising steps and fixed-point iterations. We sweep the number of denoising steps and FP iterations for FP-MDLM with consistency regularization and three-state reuse. The left heatmap reports generative perplexity, where lower is better, and the right heatmap reports sample entropy. Each annotated cell corresponds to one evaluated allocation; blank cells were not evaluated. Allocating co… view at source ↗
Figure 14
Figure 14. Figure 14: Generative perplexity as a function of budget for different denoising strategies. Entropy values are shown in parentheses. We observe that decreasing schedules perform best overall. F.11 Latency and generation quality for language modeling We measure generation-only sampling latency, defined as the wall-clock time from fully masked token IDs to final generated token IDs. The timed region includes all deno… view at source ↗
Figure 15
Figure 15. Figure 15: Generation quality as a function of wall-clock sampling latency on OWT. We report generation-only latency, measured from fully masked token IDs to final generated token IDs, excluding decoding and external Gen. PPL evaluation. Points are annotated by their transformer-block budget. CoFRe is modestly slower than MDLM+SDTT at matched budget, but reaches substantially lower Gen. PPL at lower wall-clock laten… view at source ↗
Figure 16
Figure 16. Figure 16: Fixed-point residual analysis. (Left) Mean relative residual decreases with FP iterations, showing that the repeated block approaches a fixed point. Full reuse starts much closer to the solution than no reuse. (Right) Across denoising steps, reuse strongly reduces the initial residual and yields lower final residuals under the same iteration budget. Residuals validate the solver and warm-start mechanism, … view at source ↗
read the original abstract

Masked Generative Models (MGMs) enable parallel decoding and achieve strong performance across modalities, but require full-sequence bidirectional transformers at every step, making training costly and degrading quality under low sampling budgets. Existing work improves efficiency via better samplers or cheaper fixed-depth denoisers, but they still allocate a fixed amount of denoiser computation to each refinement step. We introduce Fixed-Point Masked Generative Models (FP-MGMs), which replace part of the denoiser with a fixed-point solver over shared attention layers to enable adaptive depth with fewer parameters. To make it more effective for masked generation, we first introduce a cross-step consistency loss, which aligns hidden representations at neighboring denoising steps and, second, three-state reuse (3SR) which warm-starts the solver using the previous solution by treating differently unchanged, still-masked, and newly revealed tokens respectively. Together, these components define our complete training-to-inference framework for fixed-point masked generation, \emph{CoFRe}. We also show that pre-trained MGMs can be converted into FP-MGMs with short fine-tuning, avoiding full retraining. Across modalities, CoFRe improves the quality and cost trade-off. On OpenWebText, CoFRe reduces parameters by 38.8\%, training time by 11.5\%, and VRAM by 16.9\%, while improving generative perplexity from 830.8 to 101.8 at a budget of $96$ transformer-block forward passes, compared to MDLM. In ImageNette, CoFRe reduces training time by 48.6\% and VRAM by 50.7\%, while improving FID in all sample budgets tested. Overall, CoFRe offers a practical framework for cheaper training and stronger low-budget masked generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Fixed-Point Masked Generative Models (FP-MGMs) that replace part of the denoiser with a fixed-point solver over shared attention layers for adaptive depth and fewer parameters in masked generative modeling. It adds a cross-step consistency loss to align hidden states across denoising steps and three-state reuse (3SR) to warm-start the solver by distinguishing unchanged, still-masked, and newly revealed tokens. These form the CoFRe framework, which also supports short fine-tuning of pre-trained MGMs. The paper claims substantial efficiency gains (38.8% fewer parameters, 11.5% less training time, 16.9% less VRAM on OpenWebText; 48.6% less time and 50.7% less VRAM on ImageNette) and quality improvements (perplexity 830.8 to 101.8 at 96 block passes; better FID across budgets) versus MDLM.

Significance. If the results hold after verification, the work offers a practical route to lower training and inference costs in masked generative models while improving low-budget quality across text and images. The fine-tuning conversion path and explicit handling of adaptive depth via fixed-point iteration are potentially useful contributions if shown to be stable.

major comments (3)
  1. Abstract: the headline claims (perplexity drop from 830.8 to 101.8 at 96 passes; FID gains across budgets) rest on the fixed-point solver converging reliably and the consistency loss plus 3SR introducing no systematic bias, yet the manuscript supplies no residual-norm diagnostics, per-step iteration counts, or ablation that isolates the solver from the auxiliary objective.
  2. Abstract: the reported parameter (38.8%), time (11.5%), and VRAM (16.9%) reductions are presented without error bars, multiple random seeds, or controls confirming that the gains derive from the fixed-point mechanism rather than the new training losses alone.
  3. Abstract: no analysis is given of whether 3SR biases the learned distribution toward previously revealed tokens at low sampling budgets, which would undermine the quality-cost trade-off claim if present.
minor comments (2)
  1. Abstract: the acronym CoFRe is introduced without expansion.
  2. Abstract: comparisons are limited to MDLM; additional baselines would clarify the scope of the improvement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of empirical validation that we will strengthen in the revision. We address each major comment below.

read point-by-point responses
  1. Referee: Abstract: the headline claims (perplexity drop from 830.8 to 101.8 at 96 passes; FID gains across budgets) rest on the fixed-point solver converging reliably and the consistency loss plus 3SR introducing no systematic bias, yet the manuscript supplies no residual-norm diagnostics, per-step iteration counts, or ablation that isolates the solver from the auxiliary objective.

    Authors: We agree that explicit convergence diagnostics and an isolating ablation would strengthen the claims. In the revised manuscript we will add residual-norm plots across denoising steps, average per-step iteration counts, and an ablation that trains with the consistency loss and 3SR but replaces the fixed-point solver with a standard fixed-depth denoiser. These additions will directly verify reliable convergence and separate the solver's contribution from the auxiliary objectives. revision: yes

  2. Referee: Abstract: the reported parameter (38.8%), time (11.5%), and VRAM (16.9%) reductions are presented without error bars, multiple random seeds, or controls confirming that the gains derive from the fixed-point mechanism rather than the new training losses alone.

    Authors: The reported efficiency numbers compare the final CoFRe configuration against the MDLM baseline. To address the concern, the revision will include results over three random seeds with standard-error bars. We will also add a control experiment that applies the consistency loss and 3SR to a standard MGM without the fixed-point solver, allowing direct attribution of the parameter, time, and VRAM savings to the solver itself. revision: yes

  3. Referee: Abstract: no analysis is given of whether 3SR biases the learned distribution toward previously revealed tokens at low sampling budgets, which would undermine the quality-cost trade-off claim if present.

    Authors: The design of 3SR explicitly distinguishes unchanged, still-masked, and newly revealed tokens to preserve the original sampling distribution while accelerating convergence; the observed improvements in low-budget perplexity and FID are consistent with this intent. Nevertheless, we will add a targeted analysis in the revision that compares the empirical distribution of token reveal orders and per-token marginal probabilities between CoFRe and the baseline at low budgets (e.g., 32 and 64 passes) to rule out systematic bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains are independent comparisons, not reductions to fitted inputs or self-citations

full rationale

The paper introduces FP-MGMs and the CoFRe framework via a fixed-point solver over shared attention layers, a cross-step consistency loss, and three-state reuse (3SR). Reported improvements (parameter reduction, training time, VRAM, perplexity from 830.8 to 101.8, FID) are framed as direct empirical comparisons to the MDLM baseline at fixed compute budgets. No equations, derivations, or first-principles claims are present that reduce a result to its own inputs by construction, nor is any load-bearing premise justified solely by overlapping self-citation. The central claims rest on experimental outcomes rather than algebraic equivalence or fitted-parameter renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claims rest on the unstated assumption that the fixed-point solver integrates stably with masked token dynamics.

pith-pipeline@v0.9.1-grok · 5870 in / 1220 out tokens · 29713 ms · 2026-06-28T23:16:44.483874+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 63 canonical work pages · 21 internal anchors

  1. [1]

    Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA.Inter- national Conference on Learning Representations (ICLR), 2025

    Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA.Inter- national Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/ abs/2410.20672. 9

  2. [2]

    Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun. Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation.Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://arxiv.org/abs/ 2507.10524. 9

  3. [3]

    Zico Kolter, and Vladlen Koltun

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep Equilibrium Models.Advances in Neural Information Processing Systems (NeurIPS), 2019. URL https://arxiv.org/abs/ 1909.01377v2. 2, 4, 9

  4. [4]

    Fixed Point Diffusion Models.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

    Xingjian Bai and Luke Melas-Kyriazi. Fixed Point Diffusion Models.Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URL http://arxiv.org/abs/2401.08741. 2, 4, 5, 9, 19, 24

  5. [5]

    Halton Scheduler For Masked Generative Image Transformer.International Conference on Learning Representations (ICLR), 2025

    Victor Besnier, Mickael Chen, David Hurych, Eduardo Valle, and Matthieu Cord. Halton Scheduler For Masked Generative Image Transformer.International Conference on Learning Representations (ICLR), 2025. URL http://arxiv.org/abs/2503.17076. 2, 4, 7, 9, 25, 26

  6. [6]

    Self-Speculative Masked Diffusions.International Conference on Learning Representations (ICLR), 2026

    Andrew Campbell, Valentin De Bortoli, Jiaxin Shi, and Arnaud Doucet. Self-Speculative Masked Diffusions.International Conference on Learning Representations (ICLR), 2026. URL http://arxiv.org/abs/2510.03929. 9

  7. [7]

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. MaskGIT: Masked Generative Image Transformer.Conference on Computer Vision and Pattern Recognition (CVPR), 2022. URLhttp://arxiv.org/abs/2202.04200. 2, 4

  8. [8]

    One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling.arXiv preprint arXiv:1312.3005, 2014. URL http://arxiv.org/abs/1312.3005. 25

  9. [9]

    Masked Diffusion Models as Energy Minimization.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Sitong Chen, Shen Nie, Jiacheng Sun, Zijin Feng, Zhenguo Li, Ji-Rong Wen, and Chongxuan Li. Masked Diffusion Models as Energy Minimization.Advances in Neural Information Processing Systems (NeurIPS), 2025. URLhttp://arxiv.org/abs/2509.13866. 2 10

  10. [10]

    SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond.arXiv preprint arXiv:2406.17672, 2024

    Marco Comunità, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, and Yuki Mitsufuji. SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond.arXiv preprint arXiv:2406.17672, 2024. URLhttp://arxiv.org/abs/2406.17672. 2

  11. [11]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers.International Conference on Learning Representations (ICLR), 2024. URL http://arxiv.org/abs/2309.16588. 26

  12. [12]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Univer- sal Transformers.International Conference on Learning Representations (ICLR), 2019. URL http://arxiv.org/abs/1807.03819. 2, 9

  13. [13]

    Imagenet: A large-scale hierarchical image database.Conference on Computer Vision and Pattern Recognition (CVPR),

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.Conference on Computer Vision and Pattern Recognition (CVPR),

  14. [14]

    URLhttps://ieeexplore.ieee.org/document/5206848. 3, 7

  15. [15]

    Promises, outlooks and challenges of Diffusion Language Modeling.arXiv preprint arXiv:2406.11473, 2024

    Justin Deschenaux and Caglar Gulcehre. Promises, outlooks and challenges of Diffusion Language Modeling.arXiv preprint arXiv:2406.11473, 2024. URL https://arxiv.org/ abs/2406.11473. 2, 25

  16. [16]

    Beyond autoregression: Fast LLMs via Self-Distillation Through Time.International Conference on Learning Representations (ICLR), 2025

    Justin Deschenaux and Caglar Gulcehre. Beyond autoregression: Fast LLMs via Self-Distillation Through Time.International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2410.21035. 3, 9, 26, 32, 38

  17. [17]

    Partition Generative Modeling: Masked Modeling Without Masks.International Conference on Learning Representations (ICLR), 2026

    Justin Deschenaux, Lan Tran, and Caglar Gulcehre. Partition Generative Modeling: Masked Modeling Without Masks.International Conference on Learning Representations (ICLR), 2026. URLhttp://arxiv.org/abs/2505.18883. 2, 9, 27, 31, 32

  18. [18]

    Continuous diffusion for categorical data

    Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler. Continuous diffusion for cate- gorical data.arXiv preprint arXiv:2211.15089, 2022. URL https://arxiv.org/abs/2211. 15089. 27

  19. [20]

    URLhttp://arxiv.org/abs/2602.06849. 9

  20. [21]

    JFB: Jacobian-Free Backpropagation for Implicit Networks.Association for the Advancement of Artificial Intelligence (AAAI), 2022

    Samy Wu Fung, Howard Heaton, Qiuwei Li, Daniel McKenzie, Stanley Osher, and Wotao Yin. JFB: Jacobian-Free Backpropagation for Implicit Networks.Association for the Advancement of Artificial Intelligence (AAAI), 2022. URLhttps://arxiv.org/abs/2103.12803v4. 19

  21. [22]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  22. [23]

    Zico Kolter

    Zhengyang Geng, Ashwini Pokle, and J. Zico Kolter. One-Step Diffusion Distillation via Deep Equilibrium Models.Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2401.08639. 9

  23. [24]

    Lee, and Dimitris Papailiopoulos

    Angeliki Giannou, Shashank Rajput, Jy yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers.International Conference on Machine Learning (ICML), 2023. URLhttps://arxiv.org/abs/2301.13196. 2

  24. [25]

    OpenWebText corpus

    Aaron Gokaslan and Vanya Cohen. OpenWebText corpus. http://Skylion007.github. io/OpenWebTextCorpus, 2019. 3, 7

  25. [26]

    Scaling Diffusion Language Models via Adaptation from Autoregressive Models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, and Lingpeng Kong. Scaling Diffusion Language Models via Adaptation from Autoregressive Models.International Conference on Learning Representations (ICLR), 2025. URLhttp://arxiv.org/abs/2410.17891. 9 11

  26. [27]

    J. H. Halton. Algorithm 247: Radical-inverse quasi-random point sequence.Communications of the ACM, 7(12):701–702, 1964. ISSN 0001-0782. doi: 10.1145/355588.365104. URL https://doi.org/10.1145/355588.365104. 4, 26

  27. [28]

    Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion

    Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, and Yuki Mitsufuji. Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion. Transactions on Machine Learning Research, 2026. URL http://arxiv.org/abs/2510. 04525. 2

  28. [29]

    Reasoning with Latent Tokens in Diffusion Language Models.arXiv preprint arXiv:2602.03769, 2026

    Andre He, Sean Welleck, and Daniel Fried. Reasoning with Latent Tokens in Diffusion Language Models.arXiv preprint arXiv:2602.03769, 2026. URL http://arxiv.org/abs/ 2602.03769. 2

  29. [30]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems (NeurIPS), 2017. URL https://arxiv.org/abs/ 1706.08500. 7

  30. [31]

    The curious case of neural text degeneration.International Conference on Learning Representations (ICLR), 2020

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/forum?id=rygGQyrFvH. 31

  31. [32]

    arXiv preprint arXiv:2510.05725 , url =

    Chunsan Hong, Seonho An, Min-Soo Kim, and Jong Chul Ye. Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies.International Conference on Learning Representations (ICLR), 2026. URLhttp://arxiv.org/abs/2510.05725. 9

  32. [33]

    Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD.arXiv preprint arXiv:2603.20155, 2026

    Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, and Tim Salimans. Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD.arXiv preprint arXiv:2603.20155, 2026. URLhttps://arxiv.org/abs/2603.20155. 9

  33. [34]

    Imagenette: A smaller subset of 10 easily classified classes from ImageNet

    Jeremy Howard. Imagenette: A smaller subset of 10 easily classified classes from ImageNet. https://github.com/fastai/imagenette, 2019. 3, 7

  34. [35]

    Learning Unmasking Policies for Diffusion Language Models

    Metod Jazbec, Theo X. Olausson, Louis Béthune, Pierre Ablin, Michael Kirchhof, João Monteiro, Victor Turrisi, Jason Ramapuram, and Marco Cuturi. Learning Unmasking Policies for Diffusion Language Models.arXiv preprint arXiv:2512.09106, 2026. URL http://arxiv.org/abs/2512.09106. 9

  35. [36]

    LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation.International Conference on Learning Representations (ICLR), 2026

    Ahmadreza Jeddi, Marco Ciccone, and Babak Taati. LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation.International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2602.11451. 9

  36. [37]

    Less is More: Recursive Reasoning with Tiny Networks

    Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks.arXiv preprint arXiv:2510.04871, 2025. URLhttps://arxiv.org/abs/2510.04871. 2

  37. [38]

    arXiv preprint arXiv:2502.06768 , archiveprefix =

    Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, and Sitan Chen. Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions.International Conference on Machine Learning (ICML), 2025. URLhttp://arxiv.org/abs/2502.06768. 9

  38. [39]

    Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

    Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, and Sitan Chen. Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training.arXiv preprint arXiv:2602.10314, 2026. URLhttp://arxiv.org/abs/2602.10314. 9

  39. [40]

    CDLM: Consistency Diffusion Language Models for Faster Sampling.Conference on Machine Learning and Systems (MLSys), 2026

    Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, and Amir Gholami. CDLM: Consistency Diffusion Language Models for Faster Sampling.Conference on Machine Learning and Systems (MLSys), 2026. URL https: //arxiv.org/abs/2511.19269. 9

  40. [41]

    Similarity of neural network representations revisited.International Conference on Machine Learning (ICML),

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited.International Conference on Machine Learning (ICML),

  41. [42]

    URLhttps://arxiv.org/abs/1905.00414. 22 12

  42. [43]

    IDLM: Inverse-distilled Diffusion Language Models

    David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, and Alexander Korotin. IDLM: Inverse-distilled Diffusion Language Models.arXiv preprint arXiv:2602.19066, 2026. URLhttps://arxiv.org/abs/2602.19066. 9, 32

  43. [44]

    Imagefolder: Autoregressive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

    Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024. URLhttps://arxiv.org/abs/2410.01756. 7, 25

  44. [45]

    XQ-GAN: An open-source image tokenization framework for autoregressive generation

    Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, and Bhiksha Raj. XQ-GAN: An open-source image tokenization framework for autoregressive generation. arXiv preprint arXiv:2412.01762, 2024. URL https://arxiv.org/abs/2412.01762. 7, 25

  45. [46]

    Divergence frontiers for generative models: Sample complexity, quantization effects, and frontier integrals.Advances in Neural Information Processing Systems (NeurIPS), 2021

    Lang Liu, Krishna Pillutla, Sean Welleck, Sewoong Oh, Yejin Choi, and Zaid Harchaoui. Divergence frontiers for generative models: Sample complexity, quantization effects, and frontier integrals.Advances in Neural Information Processing Systems (NeurIPS), 2021. URL https://openreview.net/forum?id=Z_J5bCb4Rra. 30

  46. [47]

    Think While You Generate: Discrete Diffusion with Planned Denoising.International Conference on Learning Representations (ICLR), 2025

    Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stärk, Yilun Xu, Tommi Jaakkola, and Rafael Gómez-Bombarelli. Think While You Generate: Discrete Diffusion with Planned Denoising.International Conference on Learning Representations (ICLR), 2025. URL https: //arxiv.org/abs/2410.06264. 9

  47. [48]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution.International Conference on Machine Learning (ICML), 2024. URLhttp://arxiv.org/abs/2310.16834. 26

  48. [49]

    Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

    Omer Luxembourg, Haim Permuter, and Eliya Nachmani. Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models.arXiv preprint arXiv:2506.19037, 2025. URL http://arxiv.org/abs/2506.19037. 9

  49. [50]

    4M: Massively Multimodal Masked Modeling.Advances in Neural Information Processing Systems (NeurIPS), 2023

    David Mizrahi, Roman Bachmann, O ˘guzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, and Amir Zamir. 4M: Massively Multimodal Masked Modeling.Advances in Neural Information Processing Systems (NeurIPS), 2023. URL http://arxiv.org/abs/ 2312.06647. 2

  50. [51]

    Scaling up Masked Diffusion Models on Text.International Conference on Learning Representations (ICLR), 2025

    Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up Masked Diffusion Models on Text.International Conference on Learning Representations (ICLR), 2025. URLhttp://arxiv.org/abs/2410.18514. 25

  51. [52]

    Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.International Conference on Learning Representations (ICLR), 2025. URL https: //arxiv.org/abs/2406.03736. 2, 3

  52. [53]

    Jump Your Steps

    Yong-Hyun Park, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, and Yuki Mitsufuji. “Jump Your Steps”: Optimizing Sampling Schedule of Discrete Diffusion Models.International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/ 2410.07761. 9

  53. [54]

    Path Planning for Masked Diffusion Model Sampling.arXiv preprint arXiv:2502.03540, 2025

    Fred Zhangzhi Peng, Zachary Bezemek, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Avishek Joey Bose, Alexander Tong, and Pranam Chatterjee. Path Planning for Masked Diffusion Model Sampling.arXiv preprint arXiv:2502.03540, 2025. URL http://arxiv. org/abs/2502.03540. 2, 9

  54. [55]

    MAUVE: Measuring the gap between neural text and human text using divergence frontiers.Advances in Neural Information Processing Systems (NeurIPS),

    Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. MAUVE: Measuring the gap between neural text and human text using divergence frontiers.Advances in Neural Information Processing Systems (NeurIPS),

  55. [56]

    URLhttps://openreview.net/forum?id=Tqx7nJp7PR. 30, 31

  56. [57]

    Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946, 2026. URL https: //arxiv.org/abs/2604.12946. 2, 9, 25 13

  57. [58]

    Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

    Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang. Generative frontiers: Why evaluation matters for diffusion language models.arXiv preprint arXiv:2604.02718, 2026. URL https://arxiv. org/abs/2604.02718. 44

  58. [59]

    Language models are unsupervised multitask learners, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019. URL https://openai.com/ blog/better-language-models/. 26

  59. [60]

    Rotskoff, Molei Tao, and Lexing Ying

    Yinuo Ren, Haoxuan Chen, Yuchen Zhu, Wei Guo, Yongxin Chen, Grant M. Rotskoff, Molei Tao, and Lexing Ying. Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms.Advances in Neural Information Processing Systems (NeurIPS), 2025. URLhttp://arxiv.org/abs/2502.00234. 9

  60. [61]

    Chiu, Alexander Rush, and V olodymyr Kuleshov

    Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and Effective Masked Diffusion Language Models.Advances in Neural Information Processing Systems (NeurIPS),

  61. [62]

    arXiv preprint arXiv:2406.07524 , year =

    URLhttp://arxiv.org/abs/2406.07524. 2, 3, 7, 8, 25, 26, 44

  62. [63]

    The Diffusion Duality.International Conference on Machine Learning (ICML), 2025

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, and V olodymyr Kuleshov. The Diffusion Duality.International Conference on Machine Learning (ICML), 2025. URLhttp://arxiv.org/abs/2506.10892. 9

  63. [64]

    Improved Techniques for Training GANs

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in Neural Information Processing Systems (NeurIPS), 2016. URLhttps://arxiv.org/abs/1606.03498. 7

  64. [66]

    URLhttp://arxiv.org/abs/2604.02340. 2

  65. [67]

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. Simplified and Generalized Masked Diffusion for Discrete Data.Advances in Neural Information Processing Systems (NeurIPS), 2024. URLhttp://arxiv.org/abs/2406.04329. 2, 3

  66. [68]

    Phenaki: Variable Length Video Generation From Open Domain Textual Description

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable Length Video Generation From Open Domain Textual Description.arXiv preprint arXiv:2210.02399, 2022. URLhttp://arxiv.org/abs/2210.02399. 2

  67. [69]

    Remasking dis- crete diffusion models with inference-time scaling.Advances in Neural Information Processing Systems (NeurIPS), 2025

    Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Remasking dis- crete diffusion models with inference-time scaling.Advances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2503.00307. 2, 30, 31

  68. [70]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7B: Diffusion Large Language Models.arXiv preprint arXiv:2508.15487, 2025. URLhttp://arxiv.org/abs/2508.15487. 9

  69. [71]

    Effective and Efficient Masked Image Generation Models.International Conference on Machine Learning (ICML), 2025

    Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, and Chongxuan Li. Effective and Efficient Masked Image Generation Models.International Conference on Machine Learning (ICML), 2025. URLhttp://arxiv.org/abs/2503.07197. 2

  70. [73]

    URLhttp://arxiv.org/abs/2602.11698. 9

  71. [74]

    Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang

    Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, and Lu Jiang. MAGVIT: Masked Generative Video Transformer.Conference on Computer Vision and Pattern Recognition (CVPR), 2023. URL https://openaccess.thecvf.com/content/CVPR2023/papers/ Yu_MAGVIT_Masked_Generative_Video_Tr...

  72. [75]

    Expert-choice routing enables adaptive computation in diffusion language models.arxiv preprint arXiv:2604.01622, 2026

    Shuibai Zhang, Caspian Zhuang, Chihan Cui, Zhihan Yang, Fred Zhangzhi Peng, Yanxin Zhang, Haoyue Bai, Zack Jia, Yang Zhou, Guanhua Chen, and Ming Liu. Expert-choice routing enables adaptive computation in diffusion language models.arxiv preprint arXiv:2604.01622, 2026. URLhttps://arxiv.org/abs/2604.01622. 2

  73. [76]

    Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling.International Conference on Learning Representations (ICLR), 2025

    Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling.International Conference on Learning Representations (ICLR), 2025. URLhttp://arxiv.org/abs/2409.02908. 26, 27, 31

  74. [77]

    Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

    Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, and Vicky Kalogeiton. Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator.Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. URL http://arxiv.org/abs/ 2503.15457. 9 15 Contents 1 Introduction 1 2 Background 3 2.1 Discrete generative models and masked g...

  75. [78]

    For language modeling, we tune on OWT for 100k steps with sequence length 128 in order to obtain fast and cheap comparisons

    Choice of tuning setup.We first select the target setting for hyperparameter tuning. For language modeling, we tune on OWT for 100k steps with sequence length 128 in order to obtain fast and cheap comparisons

  76. [79]

    We evaluate a small grid of candidate architectures and retain the best-performing one

    Architecture tuning.We tune the number of preprocessing and postprocessing layers in the fixed-point backbone, while keeping the fixed-point solver hyperparameters at the default values of Bai and Melas-Kyriazi [4]. We evaluate a small grid of candidate architectures and retain the best-performing one

  77. [80]

    This isolates the effect of the implicit solver from that of the backbone architecture

    Solver tuning.With the architecture fixed, we tune the fixed-point solver budget, including the number of no-gradient and with-gradient iterations. This isolates the effect of the implicit solver from that of the backbone architecture

  78. [81]

    Learning-rate tuning.With both the architecture and solver settings fixed, we tune the learning rate over a logarithmic grid and select the value that gives the best validation performance

  79. [82]

    Boundary check.Whenever the best hyperparameter lies at the edge of the tested range, we extend the search range and repeat the evaluation until the selected value is not on the boundary

  80. [83]

    24 D.1.2 Learning-rate and solver tuning We tune the main optimization and solver hyperparameters through small-scale experiments before running the full training jobs

    Final selection.Finally, we choose the configuration that performs best under this tuning protocol and use it for the full training runs. 24 D.1.2 Learning-rate and solver tuning We tune the main optimization and solver hyperparameters through small-scale experiments before running the full training jobs. We first test the base learning rate used for MDLM...

Showing first 80 references.