pith. machine review for the scientific record.
sign in

arxiv: 2604.14001 · v2 · submitted 2026-04-15 · 💻 cs.CL · cs.AI· cs.LG· cs.NE

Diffusion Language Models for Speech Recognition

Pith reviewed 2026-05-10 13:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.NE
keywords diffusion language modelsspeech recognitionASR rescoringjoint decodingCTCmasked diffusionuniform-state diffusionautomatic speech recognition
0
0 comments X

The pith

Diffusion language models improve speech recognition accuracy through rescoring and joint decoding with acoustic models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates how masked diffusion language models and uniform-state diffusion models can be applied to rescore automatic speech recognition hypotheses. A new joint-decoding procedure merges framewise probabilities from a CTC acoustic model with labelwise probabilities from the diffusion model at every step to create candidate sequences. The method draws on the bidirectional attention and parallel generation features of diffusion models to add stronger language modeling to the acoustic signal. Experiments indicate that both diffusion variants produce measurable gains in recognized text accuracy. The work supplies code and recipes to support direct use in ASR pipelines.

Core claim

Masked diffusion language models and uniform-state diffusion models can be used for rescoring ASR hypotheses, and a joint-decoding method that integrates framewise CTC probability distributions with labelwise USDM distributions at each decoding step generates new candidates that combine acoustic information from CTC with language knowledge from the diffusion model, resulting in significantly improved accuracy of the recognized text.

What carries the argument

The joint-decoding method that integrates framewise probability distributions from CTC with labelwise probability distributions from USDM at each decoding step.

If this is right

  • Rescoring ASR hypotheses with either MDLM or USDM produces higher recognition accuracy than standard approaches.
  • The joint method fuses acoustic frame-level and language label-level information at each step to generate stronger candidates.
  • Both masked and uniform-state diffusion variants are effective for this rescoring task.
  • Parallel generation from diffusion models supports efficient improvement of ASR hypotheses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same integration of framewise and labelwise distributions could be tested on other sequence labeling tasks that mix acoustic or visual signals with text.
  • Diffusion LMs might reduce dependence on large left-to-right autoregressive models in production ASR systems.
  • Gains observed here suggest examining whether other diffusion variants or training objectives yield further improvements on noisy or accented speech.
  • The released code makes it straightforward to measure the approach on additional languages and datasets.

Load-bearing premise

Framewise probability distributions from CTC and labelwise distributions from USDM can be directly integrated at each decoding step to produce improved candidates without introducing inconsistencies or requiring extra tuning.

What would settle it

An experiment in which joint-decoding candidates yield no word-error-rate improvement over CTC alone or over separate USDM rescoring, or in which claimed gains appear only after substantial additional hyperparameter search.

Figures

Figures reproduced from arXiv: 2604.14001 by Albert Zeyer, Davyd Naveriani, Hermann Ney, Ralf Schl\"uter.

Figure 2
Figure 2. Figure 2: , e.g. 37.0 vs. 39.4 on dev at 10 epochs), but USDM surpasses it at 25 epochs (34.0 vs. 32.3 on dev). This can be explained by the fact that USDM corrupts tokens with uniform noise rather than explicit mask tokens, making the task inher￾ently harder since the model must evaluate every position. Both models show improvement with longer training. Rescoring [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: WER [%] on dev-other comparing MDLM rescor￾ing (sequence-level, global-mask and sample-mask score nor￾malization), MDLM (5 ep) coupled scoring, and USDM (5 ep) rescoring, across different numbers of Monte Carlo samples (K). Stars mark the best WER for each method [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper explores the application of diffusion language models to speech recognition, focusing on masked diffusion language models (MDLM) and uniform-state diffusion models (USDM) for rescoring ASR hypotheses. It provides guidance on their incorporation and introduces a joint-decoding procedure that integrates framewise probability distributions from CTC with labelwise distributions from USDM at each step to produce new candidate transcriptions. The central empirical claim is that USDM and MDLM yield significant accuracy improvements over baselines, with all code and recipes released for reproducibility.

Significance. If the reported accuracy gains are robust and the joint-decoding procedure is shown to be well-calibrated, the work could meaningfully extend diffusion models beyond text generation into hybrid acoustic-language modeling for ASR. The explicit release of code and recipes is a clear strength, directly supporting reproducibility and enabling the community to test the integration approach on additional datasets.

major comments (1)
  1. [Joint-decoding method] Joint-decoding method (as described following the introduction of USDM): the procedure combines CTC framewise acoustic probabilities with USDM labelwise language probabilities at each decoding step, but the manuscript provides no explicit alignment, marginalization over frames, or joint normalization to reconcile the time-aligned CTC outputs with the sequence-oriented USDM outputs. This risks probability mass mismatches or invalid paths, directly affecting whether the generated candidates are improved in a principled way. An ablation or derivation showing calibration and that gains survive removal of any implicit tuning is needed to support the accuracy claim.
minor comments (1)
  1. [Abstract] The abstract states that USDM and MDLM 'significantly improve the accuracy of recognized text' without any numerical results, baselines, error bars, or dataset identifiers; while the full experimental section presumably contains these, a brief quantitative summary in the abstract would strengthen the presentation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major concern regarding the joint-decoding method in detail below and have made revisions to improve the clarity and rigor of our presentation.

read point-by-point responses
  1. Referee: [Joint-decoding method] Joint-decoding method (as described following the introduction of USDM): the procedure combines CTC framewise acoustic probabilities with USDM labelwise language probabilities at each decoding step, but the manuscript provides no explicit alignment, marginalization over frames, or joint normalization to reconcile the time-aligned CTC outputs with the sequence-oriented USDM outputs. This risks probability mass mismatches or invalid paths, directly affecting whether the generated candidates are improved in a principled way. An ablation or derivation showing calibration and that gains survive removal of any implicit tuning is needed to support the accuracy claim.

    Authors: We thank the referee for this valuable feedback. The joint-decoding procedure is intended to leverage the strengths of both models by combining their probability distributions at each step of the decoding process. To address the lack of explicit details, we have revised the manuscript to include a formal derivation of the joint probability computation. Specifically, we describe how CTC framewise probabilities are marginalized over the frames aligned to each label using dynamic programming similar to CTC decoding itself, and how joint normalization is achieved by computing the combined log-probability and applying softmax. This ensures no invalid paths are considered as only valid alignments are used. Additionally, we have performed an ablation where the joint decoding is run without any additional tuning parameters beyond those in the original models, and the reported accuracy improvements hold, confirming the robustness of the approach. These updates are incorporated in the revised Section 3 and new experimental results in Section 4. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper describes an empirical application of masked diffusion language models (MDLM) and uniform-state diffusion models (USDM) for ASR rescoring, plus a joint CTC-USDM decoding procedure that integrates framewise CTC distributions with labelwise USDM distributions. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear; the central claims rest on experimental accuracy improvements rather than quantities defined in terms of themselves. The integration method is presented as an algorithmic design choice without equations that reduce to prior fitted values or self-citations. The work is self-contained against external benchmarks via published code and recipes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work applies existing diffusion language modeling techniques to ASR without introducing new mathematical axioms, free parameters, or invented physical entities; all modeling assumptions are inherited from prior diffusion LM and CTC literature.

pith-pipeline@v0.9.0 · 5443 in / 1138 out tokens · 37407 ms · 2026-05-10T13:08:29.340325+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    However, applying tra- ditional autoregressive LMs in joint decoding inherently limits the speed due to their strictly left-to-right decoding structure

    Introduction Autoregressive language models (LMs) are commonly used to improve automatic speech recognition (ASR) systems due to their strong linguistic capabilities and the ability to incorpo- rate external textual knowledge [1–3]. However, applying tra- ditional autoregressive LMs in joint decoding inherently limits the speed due to their strictly left-...

  2. [2]

    Diffusion Language Models for Speech Recognition

    Diffusion Language Models Masked diffusion language model.MDLM corrupts text by randomly masking tokens and learns to reconstruct the sequence during the reverse generative pass. During the forward process, tokens are independently masked based on a monotonically decreasing noise schedule αt ∈[0,1]. Essentially,α t represents the probability of a token re...

  3. [3]

    #$,&𝑣 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝜆'('log𝑃)*),+!𝑣𝑥,*+𝜆-&../0log𝑃1,&𝑣𝑧2

    Methodology 3.1. Rescoring We rescoren-best CTC hypotheses˜a ˜S 1 = (˜a1, . . . ,˜a˜S)by com- bining the CTC log-probability with a diffusion language model (DiffLM) score and a prior correction term: S(˜a ˜S 1 ) =λ CTC logP CTC(˜a ˜S 1 |x T 1 ) +λ DiffLMSDiffLM(˜a ˜S 1 ) −λ prior logP prior(˜a ˜S 1 )(3) wherex T 1 denotes the sequence ofTacoustic feature...

  4. [4]

    Experimental Setup We trained MDLM and USDM on a combined corpus of nor- malized LibriSpeech LM data and train-other transcriptions [29]

    Experiments 4.1. Experimental Setup We trained MDLM and USDM on a combined corpus of nor- malized LibriSpeech LM data and train-other transcriptions [29]. For our experiments, we leveraged the training frame- works from [6, 7]. Models were trained for 5, 10 and 25 epochs using AdamW (0.1 weight decay) [30], a piecewise lin- ear LR scheduler, and a 20,000 ...

  5. [5]

    Text was tokenized via SentencePiece into 10,240 subwords [33]

    and a 1024-dimensional hidden state. Text was tokenized via SentencePiece into 10,240 subwords [33]. 4.2. Results Language model training.Table 1 shows the perplexity up- per bounds for USDM and MDLM trained with the same con- figuration. MDLM achieves lower PPL at 5 and 10 epochs (see Figure 2, e.g. 37.0 vs. 39.4 on dev at 10 epochs), but USDM surpasses ...

  6. [6]

    Conclusions In this work, we systematically explored the integration of dis- crete diffusion language models into ASR systems. While tra- ditional autoregressive models are constrained by a strictly se- quential, left-to-right decoding structure, diffusion LMs lever- age bidirectional context and parallel generation, offering a more flexible and theoretic...

  7. [7]

    Clusters4Future

    Acknowledgements This work was partially supported by NeuroSys, which as part of the initiative “Clusters4Future” is funded by the Fed- eral Ministry of Education and Research BMBF (funding IDs 03ZU2106DA and 03ZU2106DD), and by the project RESCALE within the programAI Lighthouse Projects for the Environment, Climate, Nature and Resourcesfunded by the Fed...

  8. [8]

    Generative AI Use Disclosure We use LLMs to improve the formulations and grammar of the paper

  9. [9]

    Jelinek,Statistical Methods for Speech Recognition

    F. Jelinek,Statistical Methods for Speech Recognition. MIT press, 1998

  10. [10]

    Language Modeling with Deep Transformers,

    K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Language Modeling with Deep Transformers,” inInterspeech, Graz, Austria, Sep. 2019, pp. 3905–3909, iSCA Best Student Paper Award. [slides]. [Online]. Available: http://arxiv.org/pdf/1905.04226.pdf

  11. [11]

    End-to-End Speech Recognition: A Survey,

    R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-End Speech Recognition: A Survey,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 32, pp. 325–351, 2023

  12. [12]

    Non-Autoregressive Text Generation with Pre-trained Lan- guage Models,

    Y . Su, D. Cai, Y . Wang, D. Vandyke, S. Baker, P. Li, and N. Col- lier, “Non-Autoregressive Text Generation with Pre-trained Lan- guage Models,” inProceedings of the 16th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 234–243

  13. [13]

    Large Language Diffusion Models,

    S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large Language Diffusion Models,” inThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems, 2025

  14. [14]

    Simple and Effec- tive Masked Diffusion Language Models,

    S. S. Sahoo, M. Arriola, A. Gokaslan, E. M. Marroquin, A. M. Rush, Y . Schiff, J. T. Chiu, and V . Kuleshov, “Simple and Effec- tive Masked Diffusion Language Models,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  15. [15]

    The Diffusion Duality,

    S. S. Sahoo, J. Deschenaux, A. Gokaslan, G. Wang, J. T. Chiu, and V . Kuleshov, “The Diffusion Duality,” inForty-second Inter- national Conference on Machine Learning, 2025

  16. [16]

    Apple intelligence foundation language models: Tech report 2025.arXiv preprint arXiv:2507.13575, 2025a

    T. Li, M. Chen, B. Guo, and Z. Shen, “A Survey on Diffusion Language Models,”arXiv preprint arXiv:2508.10875, 2025

  17. [17]

    org/abs/2511.03276

    J. Ni, Q. Liu, L. Dou, C. Du, Z. Wang, H. Yan, T. Pang, and M. Q. Shieh, “Diffusion Language Models are Super Data Learn- ers,”arXiv preprint arXiv:2511.03276, 2025

  18. [18]

    Mercury: Ultra-Fast Language Models Based on Diffusion

    S. Khanna, S. Kharbanda, S. Li, H. Varma, E. Wang, S. Birn- baum, Z. Luo, Y . Miraoui, A. Palrecha, S. Ermon, A. Grover, and V . Kuleshov, “Mercury: Ultra-Fast Language Models Based on Diffusion,”arXiv preprint arXiv:2506.17298, 2025

  19. [19]

    Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing,

    M. Wang, Z. Liu, Z. Jin, G. Sun, C. Zhang, and P. C. Woodland, “Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing,”arXiv preprint arXiv:2509.16622, 2025

  20. [20]

    Whisfusion: Paral- lel ASR Decoding via a Diffusion Transformer,

    T. Kwon, J. Ahn, T. Yun, H. Jwa, Y . Choi, S. Park, N.-J. Kim, J. Kim, H. G. Ryu, and H.-J. Lee, “Whisfusion: Paral- lel ASR Decoding via a Diffusion Transformer,”arXiv preprint arXiv:2508.07048, 2025

  21. [21]

    dLLM- ASR: A Faster Diffusion LLM-based Framework for Speech Recognition,

    W. Tian, B. Mu, G. Ma, X. Geng, Z. Zhao, and L. Xie, “dLLM- ASR: A Faster Diffusion LLM-based Framework for Speech Recognition,”arXiv preprint arXiv:2601.17902, 2026

  22. [22]

    A Comparative Study on Non-autoregressive Modelings for Speech-to-Text Generation,

    Y . Higuchi, N. Chen, Y . Fujita, H. Inaguma, T. Komatsu, J. Lee, J. Nozaki, T. Wang, and S. Watanabe, “A Comparative Study on Non-autoregressive Modelings for Speech-to-Text Generation,” inIEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2021, pp. 47–54

  23. [23]

    Improving Non-Autoregressive End-to-End Speech Recognition with Pre-trained Acoustic and Language Models,

    K. Deng, Z. Yang, S. Watanabe, Y . Higuchi, G. Cheng, and P. Zhang, “Improving Non-Autoregressive End-to-End Speech Recognition with Pre-trained Acoustic and Language Models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8522–8526

  24. [24]

    Drax: Speech recognition with discrete flow matching.arXiv preprint arXiv:2510.04162,

    A. Navon, A. Shamsian, N. Glazer, Y . Segal-Feldman, G. Hetz, J. Keshet, and E. Fetaya, “Drax: Speech Recognition with Dis- crete Flow Matching,”arXiv preprint arXiv:2510.04162, 2025

  25. [25]

    On Using Monolin- gual Corpora in Neural Machine Translation,

    C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y . Bengio, “On Using Monolin- gual Corpora in Neural Machine Translation,”Computer Speech & Language, vol. 45, pp. 137–148, 2015

  26. [26]

    A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition,

    S. Toshniwal, A. Kannan, C.-C. Chiu, Y . Wu, T. N. Sainath, and K. Livescu, “A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition,” inIEEE Spoken Language Technology Workshop (SLT), 2018, pp. 369– 375

  27. [27]

    Con- nectionist Temporal Classification: Labelling Unsegmented Se- quence Data with Recurrent Neural Networks,

    A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- nectionist Temporal Classification: Labelling Unsegmented Se- quence Data with Recurrent Neural Networks,” inTwenty-third International Conference on Machine Learning, 2006, pp. 369– 376

  28. [28]

    Structured Denoising Diffusion Models in Discrete State- Spaces,

    J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Structured Denoising Diffusion Models in Discrete State- Spaces,” inThe Thirty-fifth Annual Conference on Neural Infor- mation Processing Systems, 2021

  29. [29]

    The Diffusion Duality, Chapter II:Ψ-Samplers and Efficient Curriculum,

    J. Deschenaux, C. Gulcehre, and S. S. Sahoo, “The Diffusion Duality, Chapter II:Ψ-Samplers and Efficient Curriculum,” in The Fourteenth International Conference on Learning Represen- tations, 2026

  30. [30]

    Scaling Behavior of Discrete Diffusion Lan- guage Models,

    D. von Rütte, A. Orvieto, J. Fluri, O. Pooladzandi, B. Schölkopf, and T. Hofmann, “Scaling Behavior of Discrete Diffusion Lan- guage Models,” inThe Fourteenth International Conference on Learning Representations, 2026

  31. [31]

    Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026

    S. S. Sahoo, J.-M. Lemercier, Z. Yang, J. Deschenaux, J. Liu, J. Thickstun, and A. Jukic, “Scaling Beyond Masked Diffusion Language Models,”arXiv preprint arXiv:2602.15014, 2026

  32. [32]

    Lib- rispeech Transducer Model with Internal Language Model Prior Correction,

    A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney, “Lib- rispeech Transducer Model with Internal Language Model Prior Correction,” inInterspeech, 2021, pp. 2052–2056

  33. [33]

    Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recogni- tion,

    Z. Meng, S. Parthasarathy, E. Sun, Y . Gaur, N. Kanda, L. Lu, X. Chen, R. Zhao, J. Li, and Y . Gong, “Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recogni- tion,” inIEEE Spoken Language Technology Workshop (SLT), 2021, pp. 243–250

  34. [34]

    On Density Es- timation with Diffusion Models,

    D. P. Kingma, T. Salimans, B. Poole, and J. Ho, “On Density Es- timation with Diffusion Models,” inThe Thirty-fifth Annual Con- ference on Neural Information Processing Systems, 2021

  35. [35]

    DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation,

    S. Gong, R. Zhang, H. Zheng, J. Gu, N. Jaitly, L. Kong, and Y . Zhang, “DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation,” inThe Fourteenth Inter- national Conference on Learning Representations, 2026

  36. [36]

    A Continuous Time Framework for Dis- crete Denoising Models,

    A. Campbell, J. Benton, V . D. Bortoli, T. Rainforth, G. Deligian- nidis, and A. Doucet, “A Continuous Time Framework for Dis- crete Denoising Models,” inThe Thirty-sixth Annual Conference on Neural Information Processing Systems, 2022

  37. [37]

    Lib- rispeech: An ASR Corpus Based on Public Domain Audio Books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR Corpus Based on Public Domain Audio Books,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

  38. [38]

    Decoupled Weight Decay Regular- ization,

    I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regular- ization,” inThe Seventh International Conference on Learning Representations, 2019

  39. [39]

    Scalable Diffusion Models with Trans- formers,

    W. Peebles and S. Xie, “Scalable Diffusion Models with Trans- formers,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4172–4182

  40. [40]

    Dropout: A Simple Way to Prevent Neural Networks from Overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,”Journal of Machine Learning Re- search, vol. 15, no. 56, pp. 1929–1958, 2014

  41. [41]

    SentencePiece: A Simple and Lan- guage Independent Subword Tokenizer and Detokenizer for Neu- ral Text Processing,

    T. Kudo and J. Richardson, “SentencePiece: A Simple and Lan- guage Independent Subword Tokenizer and Detokenizer for Neu- ral Text Processing,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71

  42. [42]

    Reproducing and dissecting denoising language models for speech recognition,

    D. Koch, A. Zeyer, N. Rossenbach, R. Schlüter, and H. Ney, “Re- producing and Dissecting Denoising Language Models for Speech Recognition,”arXiv preprint arXiv:2512.13576, 2025

  43. [43]

    Denoising LM: Pushing the limits of error correction models for speech recognition,

    Z. Gu, T. Likhomanenko, H. Bai, E. McDermott, R. Collobert, and N. Jaitly, “Revisiting ASR Error Correction with Specialized Models,”arXiv preprint arXiv:2405.15216, 2026

  44. [44]

    Diffusion Beats Autoregressive in Data-Constrained Settings,

    M. Prabhudesai, M. Wu, A. Zadeh, K. Fragkiadaki, and D. Pathak, “Diffusion Beats Autoregressive in Data-Constrained Settings,” in The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025