Whisfusion: Parallel ASR Decoding with Masked Diffusion

Heeju Jwa; Hyuk-jae Lee; Hyungon Ryu; Jongchan Kim; Junhyuk Ahn; Nam-Joon Kim; Siwon Park; Taegeun Yun; Taeyoun Kwon; Yoonchae Choi

arxiv: 2508.07048 · v2 · pith:CVBEYW2Unew · submitted 2025-08-09 · 💻 cs.SD · cs.AI· cs.LG· eess.AS

Whisfusion: Parallel ASR Decoding with Masked Diffusion

Taeyoun Kwon , Junhyuk Ahn , Taegeun Yun , Heeju Jwa , Yoonchae Choi , Siwon Park , Jongchan Kim , Hyungon Ryu

show 2 more authors

Hyuk-Jae Lee Nam-Joon Kim

This is my paper

classification 💻 cs.SD cs.AIcs.LGeess.AS

keywords maskeddiffusionaccuracywhisfusionmodelswhilebottleneckcompetitive

0 comments

read the original abstract

Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency scale with transcript length. A natural alternative, CTC-style non-autoregressive (NAR) systems avoid this bottleneck but their conditional independence assumption sacrifices transcript-level generative modeling. Masked diffusion language models (e.g., LLaDA, MDLM) offer a competitive NAR text-generation approach. We ask whether such models can bring NAR ASR into the accuracy regime of strong AR ASR systems while removing the left-to-right bottleneck. We propose Whisfusion, which trains a dedicated masked diffusion decoder from scratch on top of frozen Whisper-large-v3 audio embeddings, denoising masked transcripts in just a few steps. We train on ~68k hours of 11-language speech with high-mask specialization to align training with the fully masked starting point of inference, and decode via Parallel Diffusion Decoding. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK benchmarks, while running 4-5x faster, additionally surpassing Whisper-turbo in both accuracy and throughput. It reaches accuracy competitive with Canary and Qwen3-ASR while running 3-7x faster. These results establish masked diffusion as a Pareto-competitive non-autoregressive paradigm for high-throughput multilingual transcription. Code and model weights are available at https://github.com/taeyoun811/Whisfusion.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Diffusion Language Models for Speech Recognition
cs.CL 2026-04 unverdicted novelty 7.0

Diffusion language models and a CTC-USDM joint decoder improve ASR accuracy over standard approaches.