Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

Hyeongseop Rha; Jeong Hun Yeo; Minsu Kim; Yong Man Ro

arxiv: 2605.29613 · v1 · pith:IKSESUBInew · submitted 2026-05-28 · 📡 eess.AS · cs.SD

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

Jeong Hun Yeo , Minsu Kim , Hyeongseop Rha , Yong Man Ro This is my paper

classification 📡 eess.AS cs.SD

keywords decodingaccuracyconfidencestrategieswhileautoregressivefixed-numberhigh

0 comments

read the original abstract

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.

This paper has not been read by Pith yet.

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

discussion (0)