Diffusion Language Models for Speech Recognition
Pith reviewed 2026-05-10 13:08 UTC · model grok-4.3
The pith
Diffusion language models improve speech recognition accuracy through rescoring and joint decoding with acoustic models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Masked diffusion language models and uniform-state diffusion models can be used for rescoring ASR hypotheses, and a joint-decoding method that integrates framewise CTC probability distributions with labelwise USDM distributions at each decoding step generates new candidates that combine acoustic information from CTC with language knowledge from the diffusion model, resulting in significantly improved accuracy of the recognized text.
What carries the argument
The joint-decoding method that integrates framewise probability distributions from CTC with labelwise probability distributions from USDM at each decoding step.
If this is right
- Rescoring ASR hypotheses with either MDLM or USDM produces higher recognition accuracy than standard approaches.
- The joint method fuses acoustic frame-level and language label-level information at each step to generate stronger candidates.
- Both masked and uniform-state diffusion variants are effective for this rescoring task.
- Parallel generation from diffusion models supports efficient improvement of ASR hypotheses.
Where Pith is reading between the lines
- The same integration of framewise and labelwise distributions could be tested on other sequence labeling tasks that mix acoustic or visual signals with text.
- Diffusion LMs might reduce dependence on large left-to-right autoregressive models in production ASR systems.
- Gains observed here suggest examining whether other diffusion variants or training objectives yield further improvements on noisy or accented speech.
- The released code makes it straightforward to measure the approach on additional languages and datasets.
Load-bearing premise
Framewise probability distributions from CTC and labelwise distributions from USDM can be directly integrated at each decoding step to produce improved candidates without introducing inconsistencies or requiring extra tuning.
What would settle it
An experiment in which joint-decoding candidates yield no word-error-rate improvement over CTC alone or over separate USDM rescoring, or in which claimed gains appear only after substantial additional hyperparameter search.
Figures
read the original abstract
Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper explores the application of diffusion language models to speech recognition, focusing on masked diffusion language models (MDLM) and uniform-state diffusion models (USDM) for rescoring ASR hypotheses. It provides guidance on their incorporation and introduces a joint-decoding procedure that integrates framewise probability distributions from CTC with labelwise distributions from USDM at each step to produce new candidate transcriptions. The central empirical claim is that USDM and MDLM yield significant accuracy improvements over baselines, with all code and recipes released for reproducibility.
Significance. If the reported accuracy gains are robust and the joint-decoding procedure is shown to be well-calibrated, the work could meaningfully extend diffusion models beyond text generation into hybrid acoustic-language modeling for ASR. The explicit release of code and recipes is a clear strength, directly supporting reproducibility and enabling the community to test the integration approach on additional datasets.
major comments (1)
- [Joint-decoding method] Joint-decoding method (as described following the introduction of USDM): the procedure combines CTC framewise acoustic probabilities with USDM labelwise language probabilities at each decoding step, but the manuscript provides no explicit alignment, marginalization over frames, or joint normalization to reconcile the time-aligned CTC outputs with the sequence-oriented USDM outputs. This risks probability mass mismatches or invalid paths, directly affecting whether the generated candidates are improved in a principled way. An ablation or derivation showing calibration and that gains survive removal of any implicit tuning is needed to support the accuracy claim.
minor comments (1)
- [Abstract] The abstract states that USDM and MDLM 'significantly improve the accuracy of recognized text' without any numerical results, baselines, error bars, or dataset identifiers; while the full experimental section presumably contains these, a brief quantitative summary in the abstract would strengthen the presentation.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address the major concern regarding the joint-decoding method in detail below and have made revisions to improve the clarity and rigor of our presentation.
read point-by-point responses
-
Referee: [Joint-decoding method] Joint-decoding method (as described following the introduction of USDM): the procedure combines CTC framewise acoustic probabilities with USDM labelwise language probabilities at each decoding step, but the manuscript provides no explicit alignment, marginalization over frames, or joint normalization to reconcile the time-aligned CTC outputs with the sequence-oriented USDM outputs. This risks probability mass mismatches or invalid paths, directly affecting whether the generated candidates are improved in a principled way. An ablation or derivation showing calibration and that gains survive removal of any implicit tuning is needed to support the accuracy claim.
Authors: We thank the referee for this valuable feedback. The joint-decoding procedure is intended to leverage the strengths of both models by combining their probability distributions at each step of the decoding process. To address the lack of explicit details, we have revised the manuscript to include a formal derivation of the joint probability computation. Specifically, we describe how CTC framewise probabilities are marginalized over the frames aligned to each label using dynamic programming similar to CTC decoding itself, and how joint normalization is achieved by computing the combined log-probability and applying softmax. This ensures no invalid paths are considered as only valid alignments are used. Additionally, we have performed an ablation where the joint decoding is run without any additional tuning parameters beyond those in the original models, and the reported accuracy improvements hold, confirming the robustness of the approach. These updates are incorporated in the revised Section 3 and new experimental results in Section 4. revision: yes
Circularity Check
No significant circularity; empirical results independent of inputs
full rationale
The paper describes an empirical application of masked diffusion language models (MDLM) and uniform-state diffusion models (USDM) for ASR rescoring, plus a joint CTC-USDM decoding procedure that integrates framewise CTC distributions with labelwise USDM distributions. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear; the central claims rest on experimental accuracy improvements rather than quantities defined in terms of themselves. The integration method is presented as an algorithmic design choice without equations that reduce to prior fitted values or self-citations. The work is self-contained against external benchmarks via published code and recipes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
However, applying tra- ditional autoregressive LMs in joint decoding inherently limits the speed due to their strictly left-to-right decoding structure
Introduction Autoregressive language models (LMs) are commonly used to improve automatic speech recognition (ASR) systems due to their strong linguistic capabilities and the ability to incorpo- rate external textual knowledge [1–3]. However, applying tra- ditional autoregressive LMs in joint decoding inherently limits the speed due to their strictly left-...
-
[2]
Diffusion Language Models for Speech Recognition
Diffusion Language Models Masked diffusion language model.MDLM corrupts text by randomly masking tokens and learns to reconstruct the sequence during the reverse generative pass. During the forward process, tokens are independently masked based on a monotonically decreasing noise schedule αt ∈[0,1]. Essentially,α t represents the probability of a token re...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
#$,&𝑣 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝜆'('log𝑃)*),+!𝑣𝑥,*+𝜆-&../0log𝑃1,&𝑣𝑧2
Methodology 3.1. Rescoring We rescoren-best CTC hypotheses˜a ˜S 1 = (˜a1, . . . ,˜a˜S)by com- bining the CTC log-probability with a diffusion language model (DiffLM) score and a prior correction term: S(˜a ˜S 1 ) =λ CTC logP CTC(˜a ˜S 1 |x T 1 ) +λ DiffLMSDiffLM(˜a ˜S 1 ) −λ prior logP prior(˜a ˜S 1 )(3) wherex T 1 denotes the sequence ofTacoustic feature...
-
[4]
Experimental Setup We trained MDLM and USDM on a combined corpus of nor- malized LibriSpeech LM data and train-other transcriptions [29]
Experiments 4.1. Experimental Setup We trained MDLM and USDM on a combined corpus of nor- malized LibriSpeech LM data and train-other transcriptions [29]. For our experiments, we leveraged the training frame- works from [6, 7]. Models were trained for 5, 10 and 25 epochs using AdamW (0.1 weight decay) [30], a piecewise lin- ear LR scheduler, and a 20,000 ...
-
[5]
Text was tokenized via SentencePiece into 10,240 subwords [33]
and a 1024-dimensional hidden state. Text was tokenized via SentencePiece into 10,240 subwords [33]. 4.2. Results Language model training.Table 1 shows the perplexity up- per bounds for USDM and MDLM trained with the same con- figuration. MDLM achieves lower PPL at 5 and 10 epochs (see Figure 2, e.g. 37.0 vs. 39.4 on dev at 10 epochs), but USDM surpasses ...
-
[6]
Conclusions In this work, we systematically explored the integration of dis- crete diffusion language models into ASR systems. While tra- ditional autoregressive models are constrained by a strictly se- quential, left-to-right decoding structure, diffusion LMs lever- age bidirectional context and parallel generation, offering a more flexible and theoretic...
-
[7]
Clusters4Future
Acknowledgements This work was partially supported by NeuroSys, which as part of the initiative “Clusters4Future” is funded by the Fed- eral Ministry of Education and Research BMBF (funding IDs 03ZU2106DA and 03ZU2106DD), and by the project RESCALE within the programAI Lighthouse Projects for the Environment, Climate, Nature and Resourcesfunded by the Fed...
-
[8]
Generative AI Use Disclosure We use LLMs to improve the formulations and grammar of the paper
-
[9]
Jelinek,Statistical Methods for Speech Recognition
F. Jelinek,Statistical Methods for Speech Recognition. MIT press, 1998
1998
-
[10]
Language Modeling with Deep Transformers,
K. Irie, A. Zeyer, R. Schlüter, and H. Ney, “Language Modeling with Deep Transformers,” inInterspeech, Graz, Austria, Sep. 2019, pp. 3905–3909, iSCA Best Student Paper Award. [slides]. [Online]. Available: http://arxiv.org/pdf/1905.04226.pdf
-
[11]
End-to-End Speech Recognition: A Survey,
R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-End Speech Recognition: A Survey,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 32, pp. 325–351, 2023
2023
-
[12]
Non-Autoregressive Text Generation with Pre-trained Lan- guage Models,
Y . Su, D. Cai, Y . Wang, D. Vandyke, S. Baker, P. Li, and N. Col- lier, “Non-Autoregressive Text Generation with Pre-trained Lan- guage Models,” inProceedings of the 16th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 234–243
2021
-
[13]
Large Language Diffusion Models,
S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large Language Diffusion Models,” inThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems, 2025
2025
-
[14]
Simple and Effec- tive Masked Diffusion Language Models,
S. S. Sahoo, M. Arriola, A. Gokaslan, E. M. Marroquin, A. M. Rush, Y . Schiff, J. T. Chiu, and V . Kuleshov, “Simple and Effec- tive Masked Diffusion Language Models,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
2024
-
[15]
The Diffusion Duality,
S. S. Sahoo, J. Deschenaux, A. Gokaslan, G. Wang, J. T. Chiu, and V . Kuleshov, “The Diffusion Duality,” inForty-second Inter- national Conference on Machine Learning, 2025
2025
-
[16]
T. Li, M. Chen, B. Guo, and Z. Shen, “A Survey on Diffusion Language Models,”arXiv preprint arXiv:2508.10875, 2025
-
[17]
J. Ni, Q. Liu, L. Dou, C. Du, Z. Wang, H. Yan, T. Pang, and M. Q. Shieh, “Diffusion Language Models are Super Data Learn- ers,”arXiv preprint arXiv:2511.03276, 2025
-
[18]
Mercury: Ultra-Fast Language Models Based on Diffusion
S. Khanna, S. Kharbanda, S. Li, H. Varma, E. Wang, S. Birn- baum, Z. Luo, Y . Miraoui, A. Palrecha, S. Ermon, A. Grover, and V . Kuleshov, “Mercury: Ultra-Fast Language Models Based on Diffusion,”arXiv preprint arXiv:2506.17298, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing,
M. Wang, Z. Liu, Z. Jin, G. Sun, C. Zhang, and P. C. Woodland, “Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing,”arXiv preprint arXiv:2509.16622, 2025
-
[20]
Whisfusion: Paral- lel ASR Decoding via a Diffusion Transformer,
T. Kwon, J. Ahn, T. Yun, H. Jwa, Y . Choi, S. Park, N.-J. Kim, J. Kim, H. G. Ryu, and H.-J. Lee, “Whisfusion: Paral- lel ASR Decoding via a Diffusion Transformer,”arXiv preprint arXiv:2508.07048, 2025
-
[21]
dLLM- ASR: A Faster Diffusion LLM-based Framework for Speech Recognition,
W. Tian, B. Mu, G. Ma, X. Geng, Z. Zhao, and L. Xie, “dLLM- ASR: A Faster Diffusion LLM-based Framework for Speech Recognition,”arXiv preprint arXiv:2601.17902, 2026
-
[22]
A Comparative Study on Non-autoregressive Modelings for Speech-to-Text Generation,
Y . Higuchi, N. Chen, Y . Fujita, H. Inaguma, T. Komatsu, J. Lee, J. Nozaki, T. Wang, and S. Watanabe, “A Comparative Study on Non-autoregressive Modelings for Speech-to-Text Generation,” inIEEE Automatic Speech Recognition and Understanding Work- shop (ASRU), 2021, pp. 47–54
2021
-
[23]
Improving Non-Autoregressive End-to-End Speech Recognition with Pre-trained Acoustic and Language Models,
K. Deng, Z. Yang, S. Watanabe, Y . Higuchi, G. Cheng, and P. Zhang, “Improving Non-Autoregressive End-to-End Speech Recognition with Pre-trained Acoustic and Language Models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8522–8526
2022
-
[24]
Drax: Speech recognition with discrete flow matching.arXiv preprint arXiv:2510.04162,
A. Navon, A. Shamsian, N. Glazer, Y . Segal-Feldman, G. Hetz, J. Keshet, and E. Fetaya, “Drax: Speech Recognition with Dis- crete Flow Matching,”arXiv preprint arXiv:2510.04162, 2025
-
[25]
On Using Monolin- gual Corpora in Neural Machine Translation,
C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y . Bengio, “On Using Monolin- gual Corpora in Neural Machine Translation,”Computer Speech & Language, vol. 45, pp. 137–148, 2015
2015
-
[26]
A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition,
S. Toshniwal, A. Kannan, C.-C. Chiu, Y . Wu, T. N. Sainath, and K. Livescu, “A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition,” inIEEE Spoken Language Technology Workshop (SLT), 2018, pp. 369– 375
2018
-
[27]
Con- nectionist Temporal Classification: Labelling Unsegmented Se- quence Data with Recurrent Neural Networks,
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- nectionist Temporal Classification: Labelling Unsegmented Se- quence Data with Recurrent Neural Networks,” inTwenty-third International Conference on Machine Learning, 2006, pp. 369– 376
2006
-
[28]
Structured Denoising Diffusion Models in Discrete State- Spaces,
J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. van den Berg, “Structured Denoising Diffusion Models in Discrete State- Spaces,” inThe Thirty-fifth Annual Conference on Neural Infor- mation Processing Systems, 2021
2021
-
[29]
The Diffusion Duality, Chapter II:Ψ-Samplers and Efficient Curriculum,
J. Deschenaux, C. Gulcehre, and S. S. Sahoo, “The Diffusion Duality, Chapter II:Ψ-Samplers and Efficient Curriculum,” in The Fourteenth International Conference on Learning Represen- tations, 2026
2026
-
[30]
Scaling Behavior of Discrete Diffusion Lan- guage Models,
D. von Rütte, A. Orvieto, J. Fluri, O. Pooladzandi, B. Schölkopf, and T. Hofmann, “Scaling Behavior of Discrete Diffusion Lan- guage Models,” inThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[31]
Scaling beyond masked diffusion language models.arXiv preprint arXiv:2602.15014, 2026
S. S. Sahoo, J.-M. Lemercier, Z. Yang, J. Deschenaux, J. Liu, J. Thickstun, and A. Jukic, “Scaling Beyond Masked Diffusion Language Models,”arXiv preprint arXiv:2602.15014, 2026
-
[32]
Lib- rispeech Transducer Model with Internal Language Model Prior Correction,
A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney, “Lib- rispeech Transducer Model with Internal Language Model Prior Correction,” inInterspeech, 2021, pp. 2052–2056
2021
-
[33]
Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recogni- tion,
Z. Meng, S. Parthasarathy, E. Sun, Y . Gaur, N. Kanda, L. Lu, X. Chen, R. Zhao, J. Li, and Y . Gong, “Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recogni- tion,” inIEEE Spoken Language Technology Workshop (SLT), 2021, pp. 243–250
2021
-
[34]
On Density Es- timation with Diffusion Models,
D. P. Kingma, T. Salimans, B. Poole, and J. Ho, “On Density Es- timation with Diffusion Models,” inThe Thirty-fifth Annual Con- ference on Neural Information Processing Systems, 2021
2021
-
[35]
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation,
S. Gong, R. Zhang, H. Zheng, J. Gu, N. Jaitly, L. Kong, and Y . Zhang, “DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation,” inThe Fourteenth Inter- national Conference on Learning Representations, 2026
2026
-
[36]
A Continuous Time Framework for Dis- crete Denoising Models,
A. Campbell, J. Benton, V . D. Bortoli, T. Rainforth, G. Deligian- nidis, and A. Doucet, “A Continuous Time Framework for Dis- crete Denoising Models,” inThe Thirty-sixth Annual Conference on Neural Information Processing Systems, 2022
2022
-
[37]
Lib- rispeech: An ASR Corpus Based on Public Domain Audio Books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR Corpus Based on Public Domain Audio Books,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210
2015
-
[38]
Decoupled Weight Decay Regular- ization,
I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regular- ization,” inThe Seventh International Conference on Learning Representations, 2019
2019
-
[39]
Scalable Diffusion Models with Trans- formers,
W. Peebles and S. Xie, “Scalable Diffusion Models with Trans- formers,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4172–4182
2023
-
[40]
Dropout: A Simple Way to Prevent Neural Networks from Overfitting,
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,”Journal of Machine Learning Re- search, vol. 15, no. 56, pp. 1929–1958, 2014
1929
-
[41]
SentencePiece: A Simple and Lan- guage Independent Subword Tokenizer and Detokenizer for Neu- ral Text Processing,
T. Kudo and J. Richardson, “SentencePiece: A Simple and Lan- guage Independent Subword Tokenizer and Detokenizer for Neu- ral Text Processing,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71
2018
-
[42]
Reproducing and dissecting denoising language models for speech recognition,
D. Koch, A. Zeyer, N. Rossenbach, R. Schlüter, and H. Ney, “Re- producing and Dissecting Denoising Language Models for Speech Recognition,”arXiv preprint arXiv:2512.13576, 2025
-
[43]
Denoising LM: Pushing the limits of error correction models for speech recognition,
Z. Gu, T. Likhomanenko, H. Bai, E. McDermott, R. Collobert, and N. Jaitly, “Revisiting ASR Error Correction with Specialized Models,”arXiv preprint arXiv:2405.15216, 2026
-
[44]
Diffusion Beats Autoregressive in Data-Constrained Settings,
M. Prabhudesai, M. Wu, A. Zadeh, K. Fragkiadaki, and D. Pathak, “Diffusion Beats Autoregressive in Data-Constrained Settings,” in The Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.