CTC-Seeded Token Edit Refinement for Non-Autoregressive Speech Recognition

Wanting Huang; Weiran Wang

arxiv: 2606.28732 · v1 · pith:NRDKAQYInew · submitted 2026-06-27 · 📡 eess.AS

CTC-Seeded Token Edit Refinement for Non-Autoregressive Speech Recognition

Wanting Huang , Weiran Wang This is my paper

Pith reviewed 2026-06-30 08:49 UTC · model grok-4.3

classification 📡 eess.AS

keywords non-autoregressive ASRCTC hypothesisedit refinementdiffusion lossspeech recognitionparallel decodingword error rate

0 comments

The pith

Non-autoregressive ASR is achieved by parallel edit refinement of a greedy CTC hypothesis in two steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes to seed non-autoregressive automatic speech recognition with a collapsed connectionist temporal classification hypothesis instead of random or masked token sequences. An acoustic-conditioned Edit Flow decoder then predicts insertion, deletion and substitution operations directly on that hypothesis in parallel. The decoder is trained jointly with the CTC model using a continuous-time discrete diffusion loss. Just two edit steps produce substantial word error rate reductions, with further gains from classifier-free guidance and CTC-based confidence constraints on the edits.

Core claim

ASR decoding is formulated as variable-length edit refinement of a greedy CTC hypothesis. An acoustic-conditioned Edit Flow decoder operates directly on the collapsed CTC hypothesis, predicting insertion, deletion, and substitution operations in parallel. The Edit Flow decoder is jointly trained with a CTC model using a continuous-time discrete diffusion loss. During inference, two edit steps yield substantial WER reductions, classifier-free guidance enhances quality by focusing on audio features, and edit proposals are constrained using CTC confidence.

What carries the argument

The Edit Flow decoder, which takes a collapsed CTC hypothesis and predicts parallel insert/delete/substitute operations conditioned on acoustics via continuous-time discrete diffusion.

If this is right

Two edit steps yield substantial Word Error Rate reductions.
Classifier-free guidance further enhances recognition quality by focusing the model on audio features.
Constraining edit proposals using CTC confidence improves accuracy.
Decoder pretraining and pretrained encoder integration yield significant additional performance gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The seeding strategy could extend to other sequence-to-sequence tasks where an initial fast hypothesis exists.
Fewer refinement iterations may support lower-latency inference on edge devices for real-time transcription.
Joint CTC-diffusion training might reduce reliance on external length predictors in related non-autoregressive models.

Load-bearing premise

The Edit Flow decoder can reliably predict accurate parallel edit operations directly from the collapsed CTC hypothesis without many iterations or external length predictors.

What would settle it

An experiment in which two edit steps fail to produce lower word error rates than a single step or than the raw CTC baseline would falsify the claim that two parallel edits suffice.

Figures

Figures reproduced from arXiv: 2606.28732 by Wanting Huang, Weiran Wang.

**Figure 2.** Figure 2: Examples of two-steps edit flow refinement. For all examples, step 2 decoding results match ground truth transcripts. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Non-autoregressive automatic speech recognition (ASR) enables parallel decoding, but many refinement-based methods begin from random, fully masked, or fixed-length token sequences, requiring multiple iterations to reconstruct the complete transcript. We instead formulate ASR decoding as a variable-length edit refinement of a greedy connectionist temporal classification (CTC) hypothesis. An acoustic-conditioned Edit Flow decoder operates directly on the collapsed CTC hypothesis, predicting insertion, deletion, and substitution operations in parallel. The Edit Flow decoder is jointly trained with a CTC model using a continuous-time discrete diffusion loss. During inference, we find that just two edit steps yield substantial Word Error Rate (WER) reductions, and classifier-free guidance (CFG) further enhances recognition quality by focusing the model on audio features. We also constrain edit proposals using CTC confidence to improve accuracy. Finally, ablation studies validate our design choices, while decoder pretraining and pretrained encoder integration yield significant additional performance gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a clean reformulation of non-autoregressive ASR as two-step edit refinement seeded from CTC, using a diffusion-trained decoder, but the abstract supplies no numbers so the size of the gain is still unclear.

read the letter

The main move is to treat decoding as parallel insert/delete/substitute edits on a collapsed greedy CTC hypothesis rather than starting from random or masked tokens. An acoustic-conditioned Edit Flow decoder does the edits under a continuous-time discrete diffusion loss, trained jointly with the CTC model. They report that two steps plus CFG and CTC-confidence masking are enough, with extra gains from pretraining the decoder and swapping in a pretrained encoder.

The variable-length handling through edits on the CTC output is a straightforward way to avoid an external length predictor, and the diffusion objective fits the parallel prediction task. The stress-test note is right that the description is internally consistent with no obvious circularity or missing length mechanism.

The limitation is that the abstract gives no WER figures, baselines, datasets, or stability checks, so it is impossible to tell whether the claimed reductions are large enough to matter or hold up under standard conditions. If the full paper shows clear comparisons and ablations on common benchmarks, that would make the efficiency claim more convincing.

This is for people already working on non-autoregressive ASR or diffusion-based sequence models. A reader looking for a modest practical speed-up in that niche could get something out of it.

I would send it for peer review; the idea is coherent enough that the experiments deserve a proper look.

Referee Report

0 major / 3 minor

Summary. The manuscript formulates non-autoregressive ASR decoding as variable-length edit refinement starting from a collapsed greedy CTC hypothesis. An acoustic-conditioned Edit Flow decoder predicts parallel insertion, deletion, and substitution operations under a continuous-time discrete diffusion objective and is jointly trained with the CTC model. Inference is limited to two edit steps augmented by classifier-free guidance and CTC-confidence masking; the authors report WER reductions, supported by ablation studies, decoder pretraining, and pretrained-encoder integration.

Significance. If the empirical gains hold under the reported conditions, the CTC-seeded diffusion edit formulation could reduce iteration count relative to mask-based NAR baselines while preserving parallel decoding. The joint training procedure and explicit use of CTC confidence for masking constitute concrete, testable design choices that merit attention if accompanied by reproducible code or parameter settings.

minor comments (3)

[Abstract] The abstract states that two edit steps yield WER reductions and that CFG plus CTC constraints help, yet supplies no numerical values, baseline names, or dataset identifiers; the results section should present these quantities with error bars or statistical tests to allow direct verification of the central claim.
[§3] Clarify whether the Edit Flow decoder requires an external length predictor or whether length is implicitly handled by the diffusion process on the collapsed CTC sequence; a short paragraph or diagram in §3 would remove ambiguity.
[Ablation studies] Table or figure captions should explicitly list the exact WER values, number of parameters, and training steps for each ablation so that the contribution of pretraining versus the edit refinement itself can be isolated.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the core contributions of CTC-seeded edit refinement with a diffusion-based Edit Flow decoder.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new formulation of non-autoregressive ASR as variable-length edit refinement starting from a collapsed CTC hypothesis, using an acoustic-conditioned Edit Flow decoder trained jointly via continuous-time discrete diffusion loss. Inference uses two steps with CFG and CTC-confidence masking. No equations, self-citations, or fitted parameters are presented that reduce the claimed WER improvements or parallel edit predictions to inputs by construction. The approach is self-contained with empirical support from ablations and pretraining gains, consistent with the reader's assessment of score 2.0 for minor or absent circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters, axioms, or invented entities; the 'Edit Flow decoder' is introduced as a new component but its internal parameterization is not described.

pith-pipeline@v0.9.1-grok · 5685 in / 1085 out tokens · 39185 ms · 2026-06-30T08:49:52.053376+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inProceedings of the 23rd international conference on Machine learning, 2006

2006
[2]

Sequence transduction with recurrent neural networks,

A. Graves, “Sequence transduction with recurrent neural networks,” 2012

2012
[3]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964

2016
[4]

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict,

Y . Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi, “Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict,” inProceedings of Interspeech 2020, 2020, pp. 3655–3659

2020
[5]

Imputer: Sequence modelling via imputation and dynamic programming,

W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly, “Imputer: Sequence modelling via imputation and dynamic programming,” in Proceedings of the 37th International Conference on Machine Learning, vol. 119. PMLR, 2020, pp. 1403–1413

2020
[6]

Align-refine: Non- autoregressive speech recognition via iterative realignment,

E. A. Chi, J. Salazar, and K. Kirchhoff, “Align-refine: Non- autoregressive speech recognition via iterative realignment,” inPro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 1920–1927

2021
[7]

Struc- tured denoising diffusion models in discrete state-spaces,

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Struc- tured denoising diffusion models in discrete state-spaces,”Advances in neural information processing systems, vol. 34, pp. 17 981–17 993, 2021

2021
[8]

Mask-predict: Parallel decoding of conditional masked language models,

M. Ghazvininejad, O. Levy, Y . Liu, and L. Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” inEMNLP- IJCNLP, 2019

2019
[9]

Diffusion-lm improves controllable text generation,

X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, “Diffusion-lm improves controllable text generation,”Advances in neu- ral information processing systems, vol. 35, pp. 4328–4343, 2022

2022
[10]

Deliberation of stream- ing rnn-transducer by non-autoregressive decoding,

W. Wang, K. Hu, and T. N. Sainath, “Deliberation of stream- ing rnn-transducer by non-autoregressive decoding,”arXiv preprint arXiv:2112.11442, 2021

work page arXiv 2021
[11]

Edit flows: Flow matching with edit operations,

M. Havasi, B. Karrer, I. Gat, and R. T. Chen, “Edit flows: Flow matching with edit operations,”arXiv preprint arXiv:2506.09018, 2025

work page arXiv 2025
[12]

Improved mask-CTC for non-autoregressive end-to-end ASR,

Y . Higuchi, H. Inaguma, S. Watanabe, T. Ogawa, and T. Kobayashi, “Improved mask-CTC for non-autoregressive end-to-end ASR,” in2021 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, 2021

2021
[13]

Insertion-based modeling for end-to-end automatic speech recognition,

Y . Fujita, S. Watanabe, M. Omachi, and X. Chang, “Insertion-based modeling for end-to-end automatic speech recognition,” inInterspeech 2020, 2020, pp. 3660–3664

2020
[14]

A ctc alignment-based non- autoregressive transformer for end-to-end automatic speech recognition,

R. Fan, W. Chu, P. Chang, and A. Alwan, “A ctc alignment-based non- autoregressive transformer for end-to-end automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

2023
[15]

Non-autoregressive error correction for CTC-based ASR with phone- conditioned masked LM,

H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara, “Non-autoregressive error correction for CTC-based ASR with phone- conditioned masked LM,” inInterspeech 2022, 2022, pp. 3889–3893

2022
[16]

Streaming align-refine for non- autoregressive deliberation,

W. Wang, K. Hu, and T. N. Sainath, “Streaming align-refine for non- autoregressive deliberation,” inInterspeech, 2022

2022
[17]

Transfusion: Transcribing speech with multinomial diffusion,

M. Baas, K. Eloff, and H. Kamper, “Transfusion: Transcribing speech with multinomial diffusion,” 2022

2022
[18]

Cross-modality diffusion modeling and sampling for speech recognition

C.-K. Yeh, C.-C. Chen, C.-H. Hsu, and J.-T. Chien, “Cross-modality diffusion modeling and sampling for speech recognition.” inINTER- SPEECH, 2024

2024
[19]

dLLM-ASR: A faster diffusion LLM-based framework for speech recognition,

W. Tian, B. Mu, G. Ma, X. Geng, Z. Zhao, and L. Xie, “dLLM-ASR: A faster diffusion LLM-based framework for speech recognition,” 2026

2026
[20]

Large language diffusion models,

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,”Advances in Neural Information Processing Systems, vol. 38, pp. 50 608–50 646, 2026

2026
[21]

MDM-ASR: Bridging accuracy and efficiency in ASR with diffusion-based non-autoregressive decoding,

H. Yen, P.-J. Ku, A. Jukic, and S. M. Siniscalchi, “MDM-ASR: Bridging accuracy and efficiency in ASR with diffusion-based non-autoregressive decoding,” 2026

2026
[22]

Less is more: Accurate speech recognition & translation without web- scale data,

K. C. Puvvada, P. ˙Zelasko, H. Huang, O. Hrinchuk, N. R. Koluguri, K. Dhawan, S. Majumdar, E. Rastorgueva, Z. Chen, V . Lavrukhinet al., “Less is more: Accurate speech recognition & translation without web- scale data,”arXiv preprint arXiv:2406.19674, 2024

work page arXiv 2024
[23]

Audio- conditioned diffusion llms for asr and deliberation processing,

M. Wang, Z. Liu, Z. Jin, G. Sun, C. Zhang, and P. C. Woodland, “Audio- conditioned diffusion llms for asr and deliberation processing,”arXiv preprint arXiv:2509.16622, 2025

work page arXiv 2025
[24]

Drax: Speech recognition with discrete flow matching,

A. Navon, A. Shamsian, N. Glazer, Y . Segal-Feldman, G. Hetz, J. Keshet, and E. Fetaya, “Drax: Speech recognition with discrete flow matching,” 2025

2025
[25]

Diffusion Language Models for Speech Recognition

D. Naveriani, A. Zeyer, R. Schluter, and H. Ney, “Diffusion language models for speech recognition,”arXiv preprint arXiv:2604.14001, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Insertion transformer: Flexible sequence generation via insertion operations,

M. Stern, W. Chan, J. Kiros, and J. Uszkoreit, “Insertion transformer: Flexible sequence generation via insertion operations,” inProceedings of the 36th International Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 5976–5985

2019
[27]

Levenshtein transformer,

J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,” inAdvances in Neural Information Processing Systems, vol. 32, 2019

2019
[28]

Encode, tag, realize: High-precision text editing,

E. Malmi, S. Krause, S. Rothe, D. Mirylenka, and A. Severyn, “Encode, tag, realize: High-precision text editing,” inProceedings of EMNLP- IJCNLP, 2019, pp. 5054–5065

2019
[29]

FELIX: Flexible text editing through tagging and insertion,

J. Mallinson, A. Severyn, E. Malmi, and G. Garrido, “FELIX: Flexible text editing through tagging and insertion,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1244–1255

2020
[30]

Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,

Y . Leng, X. Tan, L. Zhu, J. Xu, R. Luo, L. Liu, T. Qin, X.-Y . Li, E. Lin, and T.-Y . Liu, “Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,” inAdvances in Neural Information Processing Systems, vol. 34, 2021, pp. 21 708–21 719

2021
[31]

Softcorrect: Error correction with soft detection for automatic speech recognition,

Y . Leng, X. Tan, W. Liu, K. Song, R. Wang, X.-Y . Li, T. Qin, E. Lin, and T.-Y . Liu, “Softcorrect: Error correction with soft detection for automatic speech recognition,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, pp. 13 034–13 042, 2023

2023
[32]

PATCorrect: Non-autoregressive phoneme-augmented transformer for ASR error correction,

Z. Zhang, Z. Wang, R. Kamma, S. Eswaran, and N. Sadagopan, “PATCorrect: Non-autoregressive phoneme-augmented transformer for ASR error correction,” inInterspeech 2023, 2023, pp. 3904–3908

2023
[33]

Approximate accelerated stochastic simulation of chemically reacting systems,

D. T. Gillespie, “Approximate accelerated stochastic simulation of chemically reacting systems,”The Journal of Chemical Physics, vol. 115, no. 4, pp. 1716–1733, 2001

2001
[34]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

LibriSpeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” inICASSP, 2015

2015
[36]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018

2018
[37]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202, 2023, pp. 28 492–28 518

2023

[1] [1]

Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inProceedings of the 23rd international conference on Machine learning, 2006

2006

[2] [2]

Sequence transduction with recurrent neural networks,

A. Graves, “Sequence transduction with recurrent neural networks,” 2012

2012

[3] [3]

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,

W. Chan, N. Jaitly, Q. V . Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964

2016

[4] [4]

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict,

Y . Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi, “Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict,” inProceedings of Interspeech 2020, 2020, pp. 3655–3659

2020

[5] [5]

Imputer: Sequence modelling via imputation and dynamic programming,

W. Chan, C. Saharia, G. Hinton, M. Norouzi, and N. Jaitly, “Imputer: Sequence modelling via imputation and dynamic programming,” in Proceedings of the 37th International Conference on Machine Learning, vol. 119. PMLR, 2020, pp. 1403–1413

2020

[6] [6]

Align-refine: Non- autoregressive speech recognition via iterative realignment,

E. A. Chi, J. Salazar, and K. Kirchhoff, “Align-refine: Non- autoregressive speech recognition via iterative realignment,” inPro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 1920–1927

2021

[7] [7]

Struc- tured denoising diffusion models in discrete state-spaces,

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Struc- tured denoising diffusion models in discrete state-spaces,”Advances in neural information processing systems, vol. 34, pp. 17 981–17 993, 2021

2021

[8] [8]

Mask-predict: Parallel decoding of conditional masked language models,

M. Ghazvininejad, O. Levy, Y . Liu, and L. Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” inEMNLP- IJCNLP, 2019

2019

[9] [9]

Diffusion-lm improves controllable text generation,

X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, “Diffusion-lm improves controllable text generation,”Advances in neu- ral information processing systems, vol. 35, pp. 4328–4343, 2022

2022

[10] [10]

Deliberation of stream- ing rnn-transducer by non-autoregressive decoding,

W. Wang, K. Hu, and T. N. Sainath, “Deliberation of stream- ing rnn-transducer by non-autoregressive decoding,”arXiv preprint arXiv:2112.11442, 2021

work page arXiv 2021

[11] [11]

Edit flows: Flow matching with edit operations,

M. Havasi, B. Karrer, I. Gat, and R. T. Chen, “Edit flows: Flow matching with edit operations,”arXiv preprint arXiv:2506.09018, 2025

work page arXiv 2025

[12] [12]

Improved mask-CTC for non-autoregressive end-to-end ASR,

Y . Higuchi, H. Inaguma, S. Watanabe, T. Ogawa, and T. Kobayashi, “Improved mask-CTC for non-autoregressive end-to-end ASR,” in2021 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing, 2021

2021

[13] [13]

Insertion-based modeling for end-to-end automatic speech recognition,

Y . Fujita, S. Watanabe, M. Omachi, and X. Chang, “Insertion-based modeling for end-to-end automatic speech recognition,” inInterspeech 2020, 2020, pp. 3660–3664

2020

[14] [14]

A ctc alignment-based non- autoregressive transformer for end-to-end automatic speech recognition,

R. Fan, W. Chu, P. Chang, and A. Alwan, “A ctc alignment-based non- autoregressive transformer for end-to-end automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

2023

[15] [15]

Non-autoregressive error correction for CTC-based ASR with phone- conditioned masked LM,

H. Futami, H. Inaguma, S. Ueno, M. Mimura, S. Sakai, and T. Kawahara, “Non-autoregressive error correction for CTC-based ASR with phone- conditioned masked LM,” inInterspeech 2022, 2022, pp. 3889–3893

2022

[16] [16]

Streaming align-refine for non- autoregressive deliberation,

W. Wang, K. Hu, and T. N. Sainath, “Streaming align-refine for non- autoregressive deliberation,” inInterspeech, 2022

2022

[17] [17]

Transfusion: Transcribing speech with multinomial diffusion,

M. Baas, K. Eloff, and H. Kamper, “Transfusion: Transcribing speech with multinomial diffusion,” 2022

2022

[18] [18]

Cross-modality diffusion modeling and sampling for speech recognition

C.-K. Yeh, C.-C. Chen, C.-H. Hsu, and J.-T. Chien, “Cross-modality diffusion modeling and sampling for speech recognition.” inINTER- SPEECH, 2024

2024

[19] [19]

dLLM-ASR: A faster diffusion LLM-based framework for speech recognition,

W. Tian, B. Mu, G. Ma, X. Geng, Z. Zhao, and L. Xie, “dLLM-ASR: A faster diffusion LLM-based framework for speech recognition,” 2026

2026

[20] [20]

Large language diffusion models,

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,”Advances in Neural Information Processing Systems, vol. 38, pp. 50 608–50 646, 2026

2026

[21] [21]

MDM-ASR: Bridging accuracy and efficiency in ASR with diffusion-based non-autoregressive decoding,

H. Yen, P.-J. Ku, A. Jukic, and S. M. Siniscalchi, “MDM-ASR: Bridging accuracy and efficiency in ASR with diffusion-based non-autoregressive decoding,” 2026

2026

[22] [22]

Less is more: Accurate speech recognition & translation without web- scale data,

K. C. Puvvada, P. ˙Zelasko, H. Huang, O. Hrinchuk, N. R. Koluguri, K. Dhawan, S. Majumdar, E. Rastorgueva, Z. Chen, V . Lavrukhinet al., “Less is more: Accurate speech recognition & translation without web- scale data,”arXiv preprint arXiv:2406.19674, 2024

work page arXiv 2024

[23] [23]

Audio- conditioned diffusion llms for asr and deliberation processing,

M. Wang, Z. Liu, Z. Jin, G. Sun, C. Zhang, and P. C. Woodland, “Audio- conditioned diffusion llms for asr and deliberation processing,”arXiv preprint arXiv:2509.16622, 2025

work page arXiv 2025

[24] [24]

Drax: Speech recognition with discrete flow matching,

A. Navon, A. Shamsian, N. Glazer, Y . Segal-Feldman, G. Hetz, J. Keshet, and E. Fetaya, “Drax: Speech recognition with discrete flow matching,” 2025

2025

[25] [25]

Diffusion Language Models for Speech Recognition

D. Naveriani, A. Zeyer, R. Schluter, and H. Ney, “Diffusion language models for speech recognition,”arXiv preprint arXiv:2604.14001, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Insertion transformer: Flexible sequence generation via insertion operations,

M. Stern, W. Chan, J. Kiros, and J. Uszkoreit, “Insertion transformer: Flexible sequence generation via insertion operations,” inProceedings of the 36th International Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 5976–5985

2019

[27] [27]

Levenshtein transformer,

J. Gu, C. Wang, and J. Zhao, “Levenshtein transformer,” inAdvances in Neural Information Processing Systems, vol. 32, 2019

2019

[28] [28]

Encode, tag, realize: High-precision text editing,

E. Malmi, S. Krause, S. Rothe, D. Mirylenka, and A. Severyn, “Encode, tag, realize: High-precision text editing,” inProceedings of EMNLP- IJCNLP, 2019, pp. 5054–5065

2019

[29] [29]

FELIX: Flexible text editing through tagging and insertion,

J. Mallinson, A. Severyn, E. Malmi, and G. Garrido, “FELIX: Flexible text editing through tagging and insertion,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1244–1255

2020

[30] [30]

Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,

Y . Leng, X. Tan, L. Zhu, J. Xu, R. Luo, L. Liu, T. Qin, X.-Y . Li, E. Lin, and T.-Y . Liu, “Fastcorrect: Fast error correction with edit alignment for automatic speech recognition,” inAdvances in Neural Information Processing Systems, vol. 34, 2021, pp. 21 708–21 719

2021

[31] [31]

Softcorrect: Error correction with soft detection for automatic speech recognition,

Y . Leng, X. Tan, W. Liu, K. Song, R. Wang, X.-Y . Li, T. Qin, E. Lin, and T.-Y . Liu, “Softcorrect: Error correction with soft detection for automatic speech recognition,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, pp. 13 034–13 042, 2023

2023

[32] [32]

PATCorrect: Non-autoregressive phoneme-augmented transformer for ASR error correction,

Z. Zhang, Z. Wang, R. Kamma, S. Eswaran, and N. Sadagopan, “PATCorrect: Non-autoregressive phoneme-augmented transformer for ASR error correction,” inInterspeech 2023, 2023, pp. 3904–3908

2023

[33] [33]

Approximate accelerated stochastic simulation of chemically reacting systems,

D. T. Gillespie, “Approximate accelerated stochastic simulation of chemically reacting systems,”The Journal of Chemical Physics, vol. 115, no. 4, pp. 1716–1733, 2001

2001

[34] [34]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

LibriSpeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” inICASSP, 2015

2015

[36] [36]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018

2018

[37] [37]

Robust speech recognition via large-scale weak supervi- sion,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202, 2023, pp. 28 492–28 518

2023