arxiv: 2604.19079 · v1 · submitted 2026-04-21 · 📡 eess.AS · cs.AI· cs.CL· cs.HC

Recognition: unknown

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Andrei Andrusenko, Boris Ginsburg, Lilit Grigoryan, Nune Tadevosyan, Vitaly Lavrukhin, Vladimir Bataev

Pith reviewed 2026-05-10 01:43 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.HC

keywords unified ASRRNNT transducerstreaming speech recognitionoffline ASRconsistency regularizationmode-consistencychunked attentionlow-latency ASR

0 comments

The pith

A single RNNT model can close the accuracy gap between offline and low-latency streaming ASR by training with mode-consistency regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that one transducer-based ASR model can be trained to deliver strong results in both offline batch processing and real-time streaming by combining chunk-limited attention with a regularization term that forces agreement between the two operating modes. This unification matters because maintaining separate models for each setting raises engineering overhead and limits reuse of large trained checkpoints. The authors show through experiments that the added regularization improves streaming word error rates at low latency while leaving offline accuracy intact and continuing to benefit from bigger model sizes and more training data.

Core claim

The authors introduce a Unified ASR framework for RNNT that supports both offline and streaming decoding in one model through chunk-limited attention with right context and dynamic chunked convolutions, then add an efficient Triton implementation of mode-consistency regularization (MCR-RNNT) that penalizes disagreement between the offline and streaming forward passes during training; experiments confirm this reduces the performance gap without harming either mode.

What carries the argument

Mode-consistency regularization for RNNT (MCR-RNNT), which adds a loss term that encourages identical token predictions and alignments when the same input is processed under offline versus streaming (chunked) configurations.

If this is right

Streaming accuracy at low latency improves while offline accuracy stays the same.
The unified model continues to benefit from scaling to larger sizes and larger training sets.
A single trained checkpoint can be deployed for both batch transcription and real-time applications.
The open-sourced English model provides a concrete starting point for further work on unified transducers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency idea could be tested on other transducer variants or non-transducer ASR architectures to see if the gap closes without architecture-specific changes.
Production systems might reduce the number of maintained model versions by switching from separate offline and streaming checkpoints to one regularized model.
Similar regularization between training and inference modes could be explored for other latency-sensitive sequence tasks such as machine translation or speech synthesis.

Load-bearing premise

Enforcing agreement between the offline and streaming forward passes will not prevent the model from learning representations that are optimal for each setting individually.

What would settle it

A controlled ablation in which adding MCR-RNNT increases the offline-streaming word-error-rate gap or raises error rates in both modes on the same training data and model size.

Figures

Figures reproduced from arXiv: 2604.19079 by Andrei Andrusenko, Boris Ginsburg, Lilit Grigoryan, Nune Tadevosyan, Vitaly Lavrukhin, Vladimir Bataev.

**Figure 1.** Figure 1: Unified Transducer training in dual mode with modeconsistency regularization (MCR-RNNT) loss. we target mode consistency between offline and streaming decoding and require a practical full-lattice formulation for unified training. This setup differs, and the alignment can vary significantly between modes due to the greater flexibility of offline representations. Additionally, no publicly available imple… view at source ↗

**Figure 2.** Figure 2: LibrisSpeech test other WER (%) for different chunk and right context balance during inference under fixed total latency budgets from 0.32s to 1.12s (chunk + right context). reaches the best results 5.63% from Canary-Qwen-2.5B [29] (pure offline model), making our model SOTA Unified RNNT. The second (2) Unified RNNT model (trained with smaller right context values) demonstrated the trade-off results betw… view at source ↗

read the original abstract

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical recipe for one RNNT that handles both offline and streaming ASR, with open-sourced code, but the size of the gains is still unclear without numbers.

read the letter

The paper gives a workable way to train a single RNN-T model that supports both offline and low-latency streaming decoding. They combine chunk-limited attention with right context, dynamic chunked convolutions, and a mode-consistency regularization term called MCR-RNNT to push the two modes closer together during training. The English checkpoint and Triton implementation are released, which is the part that actually lowers the barrier for others to use it.

Referee Report

2 major / 3 minor

Summary. The paper proposes a unified RNN-T ASR framework supporting both offline and low-latency streaming decoding in a single model via chunk-limited attention with right context and dynamic chunked convolutions. It introduces an efficient Triton-based mode-consistency regularization (MCR-RNNT) to encourage agreement between training modes and thereby reduce the offline-streaming performance gap. The central empirical claim is that the approach improves streaming accuracy at low latency while preserving offline performance, scales to larger models and datasets, and the framework plus an English checkpoint are open-sourced.

Significance. If the empirical results hold under rigorous validation, the work would be significant for practical ASR systems by enabling cost-effective unified models that avoid separate offline and streaming deployments. The open-sourcing of code and a checkpoint is a clear strength for reproducibility.

major comments (2)

[Abstract / Experiments] The abstract asserts experimental improvements in streaming accuracy at low latency while preserving offline performance, yet supplies no quantitative metrics, baselines, dataset sizes, or error analysis; without these the data-to-claim link cannot be verified (see also Experiments section).
[Method (MCR-RNNT)] The mode-consistency regularization is presented as encouraging agreement across modes without reducing capacity for optimal representations in either setting. If the consistency term is applied to joint-network or prediction-network outputs (as implied by the RNNT formulation), it implicitly penalizes mode-specific deviations; this can only preserve offline performance if the offline optimum already lies close to the streaming optimum, which is not guaranteed a priori and is not directly tested by a single joint training run or by ablations comparing to separately optimized models.

minor comments (3)

[Method] Clarify the precise mathematical form of the consistency loss (e.g., which outputs are compared and the weighting schedule) with an equation reference.
[Experiments] Add explicit statements of chunk size, right-context length, and latency targets in the experimental setup for reproducibility.
[Experiments] Ensure all tables report both absolute WER/CER and relative improvements with confidence intervals or multiple runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below, proposing targeted revisions to strengthen the manuscript while preserving its core contributions.

read point-by-point responses

Referee: [Abstract / Experiments] The abstract asserts experimental improvements in streaming accuracy at low latency while preserving offline performance, yet supplies no quantitative metrics, baselines, dataset sizes, or error analysis; without these the data-to-claim link cannot be verified (see also Experiments section).

Authors: We agree that including key quantitative results in the abstract would improve verifiability. In the revised version, we will add specific WER reductions for low-latency streaming (e.g., relative improvements over baselines), confirmation of offline WER parity, training dataset scale (hours of audio), and reference to the main experimental tables. The full baselines, dataset details, and error analysis remain in Section 4; this change makes the abstract self-contained without altering its length substantially. revision: yes
Referee: [Method (MCR-RNNT)] The mode-consistency regularization is presented as encouraging agreement across modes without reducing capacity for optimal representations in either setting. If the consistency term is applied to joint-network or prediction-network outputs (as implied by the RNNT formulation), it implicitly penalizes mode-specific deviations; this can only preserve offline performance if the offline optimum already lies close to the streaming optimum, which is not guaranteed a priori and is not directly tested by a single joint training run or by ablations comparing to separately optimized models.

Authors: This is a valid concern about the implicit assumption in joint training. Our experiments (Table 3 and ablations in Section 4.3) show that the unified model achieves offline WER statistically indistinguishable from or better than the offline-only baseline while improving streaming, suggesting the optima are sufficiently close under our chunking and regularization. The MCR-RNNT loss is applied with a small weighting factor (0.1) and only on selected outputs to avoid over-constraining capacity. However, we acknowledge the absence of an explicit side-by-side comparison against independently optimized offline and streaming models. We will add this ablation experiment in the revision to directly address the point. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on independent training experiments

full rationale

The paper presents a training framework (chunk-limited attention, dynamic convolutions, and MCR-RNNT regularization) whose central claims are validated by direct experiments on streaming vs. offline WER across model scales and datasets. No derivation chain, equation, or uniqueness theorem reduces the reported gains to quantities defined by the method itself; the consistency term is an added loss applied during joint training, and performance differences are measured against baselines rather than forced by construction. No self-citations are load-bearing for the core results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard neural-network training assumptions and domain knowledge about transducer models; no new physical entities are postulated and free parameters are typical hyperparameters.

free parameters (1)

chunk size and right-context length
These control the latency-accuracy trade-off and are expected to be tuned on validation data.

axioms (1)

domain assumption The RNN-T loss remains a suitable objective for both offline and streaming modes when context is limited.
Standard assumption in the ASR literature invoked by the unified training setup.

pith-pipeline@v0.9.0 · 5458 in / 1143 out tokens · 42894 ms · 2026-05-10T01:43:23.196183+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Maintaining separate mod- els for these regimes increases the cost of model development, training, validation, and deployment

Introduction Deploying automatic speech recognition (ASR) systems com- monly requires both high-accuracy offline transcription and low-latency streaming performance. Maintaining separate mod- els for these regimes increases the cost of model development, training, validation, and deployment. All of these motivate ef- forts to train a single unified model ...
[2]

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Method We train a single RNNT model with shared parameters to sup- port both offline and streaming decoding. The model follows the standard Transducer design with encoder, predictor, and joint. Our encoder uses Conformer-style blocks with multi-head attention (MHA) and convolution modules. To enable stream- ing, we restrict MHA and convolution context dur...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

ASR modeling and evaluation As the main ASR architecture, we used RNNT model based on FastConformer encoder [22] with 123M parameters

Experimental setup 3.1. ASR modeling and evaluation As the main ASR architecture, we used RNNT model based on FastConformer encoder [22] with 123M parameters. The input features are 128-dim FBanks with x8 initial subsampling. The prediction network (decoder) is a single-layer LSTM with 640 units, which increased the total model size to 128M parameters. Al...
[4]

Table 2:Average WER (%) on Open ASR Leaderboard for dif- ferent training configurations of KLD teacher, KLD weightλ, and offlineαfor the same unified RNNT L-size model

Results Table 1 presents the main evaluation results for the considered models in the offline and streaming decoding scenarios. Table 2:Average WER (%) on Open ASR Leaderboard for dif- ferent training configurations of KLD teacher, KLD weightλ, and offlineαfor the same unified RNNT L-size model. Configuration Variable Offline 1.12s 0.56s 0.32s KLD Teacher...
[5]

Conclusion We propose a new Unified ASR framework that achieves robust Transducer performance in both offline and streaming decod- ing scenarios. In addition to using chunk-limited attention and dynamic chunked convolutions, we introduce a novel mode- consistency regularization loss (MCR-RNNT), which further reduces the gap between offline and streaming e...
[6]

Transformer transducer: One model unifying streaming and non-streaming speech recognition,

A. Tripathi, J. Kim, Q. Zhang, H. Lu, and H. Sak, “Transformer transducer: One model unifying streaming and non-streaming speech recognition,”ArXiv, vol. abs/2010.03192, 2020

work page arXiv 2010
[7]

Dual-mode asr: Unify and improve streaming asr with full-context modeling,

J. Yu, W. Han, A. Gulati, C.-C. Chiu, B. Li, T. N. Sainath, Y . Wu, and R. Pang, “Dual-mode asr: Unify and improve streaming asr with full-context modeling,”ICLR, 2021

2021
[8]

Wenet: Production oriented stream- ing and non-streaming end-to-end speech recognition toolkit,

Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu, C. Yang, Z. Peng, X. Chen, L. Xie, and X. Lei, “Wenet: Production oriented stream- ing and non-streaming end-to-end speech recognition toolkit,” in Interspeech, 2021

2021
[9]

Learning a dual-mode speech recognition model via self-pruning,

C. Liu, Y . Shangguan, H. Yang, Y . Shi, R. Krishnamoorthi, and O. Kalinli, “Learning a dual-mode speech recognition model via self-pruning,”SLT, pp. 273–279, 2022

2022
[10]

Sequence transduction with recurrent neural net- works,

A. Graves, “Sequence transduction with recurrent neural net- works,” inICML, 2012

2012
[11]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiuet al., “Conformer: Convolution- augmented transformer for speech recognition,”Proc. Interspeech 2020, pp. 5036–5040, 2020

2020
[12]

Developing real- time streaming transformer transducer for speech recognition on large-scale dataset,

X. Chen, Y . Wu, Z. Wang, S. Liu, and J. Li, “Developing real- time streaming transformer transducer for speech recognition on large-scale dataset,”ICASSP, pp. 5904–5908, 2020

2020
[13]

Streaming automatic speech recognition with the transformer model,

N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speech recognition with the transformer model,”ICASSP, pp. 6074–6078, 2020

2020
[14]

Stateful conformer with cache-based inference for streaming au- tomatic speech recognition,

V . Noroozi, S. Majumdar, A. Kumar, J. Balam, and B. Ginsburg, “Stateful conformer with cache-based inference for streaming au- tomatic speech recognition,”ICASSP, 2023

2023
[15]

Unifying streaming and non- streaming zipformer-based asr,

B. Sharma, K. P. Durai, S. Venkatesan, J. Prakash, S. Ku- mar, M. Chetlur, and A. Stolcke, “Unifying streaming and non- streaming zipformer-based asr,”ACL, 2025

2025
[16]

Improving streaming speech recognition with time-shifted contextual attention and dynamic right context masking,

K. Le and D. T. Chau, “Improving streaming speech recognition with time-shifted contextual attention and dynamic right context masking,”Interspeech, 2024

2024
[17]

Dynamic chunk convolution for unified streaming and non- streaming conformer asr,

X. Li, G. Huybrechts, S. Ronanki, J. J. Farris, and S. Bodap- ati, “Dynamic chunk convolution for unified streaming and non- streaming conformer asr,”ICASSP, 2023

2023
[18]

All-in-one asr: Unifying encoder-decoder models of ctc, attention, and transducer in dual-mode asr,

T. Moriya, M. Mimura, T. Tanaka, H. Sato, R. Masumura, and A. Ogawa, “All-in-one asr: Unifying encoder-decoder models of ctc, attention, and transducer in dual-mode asr,” 2025

2025
[19]

Open asr leaderboard: Towards reproducible and transparent multilingual and long-form speech recognition evaluation,

V . Srivastav, S. Zheng, E. Bezzam, E. L. Bihan, N. Koluguri, P. ˙Zelasko, S. Majumdar, A. Moumen, and S. Gandhi, “Open asr leaderboard: Towards reproducible and transparent multilingual and long-form speech recognition evaluation,” 2025

2025
[20]

Parakeet tdt 0.6b v2 (en),

NVIDIA, “Parakeet tdt 0.6b v2 (en),” 2025. [Online]. Available: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

2025
[21]

Nemotron-speech-streaming-en-0.6b,

——, “Nemotron-speech-streaming-en-0.6b,” Jan- uary 2026. [Online]. Available: https: //huggingface.co/nvidia/nemotron-speech-streaming-en-0. 6b/tree/nemotron-speech-streaming-jan2026

2026
[22]

Cr-ctc: Consistency regularization on ctc for improved speech recognition,

Z. Yao, W. Kang, X. Yang, F. Kuang, L. Guo, H. Zhu, Z. Jin, Z. Li, L. Lin, and D. Povey, “Cr-ctc: Consistency regularization on ctc for improved speech recognition,”ICLR, 2025

2025
[23]

Transducer consistency regularization for speech to text applications,

C. Tseng, Y . Tang, and V . R. Apsingekar, “Transducer consistency regularization for speech to text applications,”SLT, 2024

2024
[24]

PyTorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “PyTorch: An imperative style, high-performance deep learning library,” in NeurIPS, vol. 32, 2019

2019
[25]

Triton: an intermediate lan- guage and compiler for tiled neural network computations,

P. Tillet, H.-T. Kung, and D. Cox, “Triton: an intermediate lan- guage and compiler for tiled neural network computations,” in Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019, pp. 10–19

2019
[26]

Nemo: a toolkit for building ai applications using neural modules,

O. Kuchaiev, J. Li, H. Nguyenet al., “Nemo: a toolkit for building ai applications using neural modules,” inNeurIPS Workshop on Systems for ML, 2019

2019
[27]

Fast Conformer with linearly scalable attention for efficient speech recognition,

D. Rekesh, N. R. Koluguri, S. Kriman,et al., “Fast Conformer with linearly scalable attention for efficient speech recognition,” inAutomatic Speech Recognition and Understanding Workshop (ASRU), 2023

2023
[28]

Emmett: Effi- cient multimodal machine translation training,

P. ˙Zelasko, Z. Chen, M. Wang, D. Galvez, O. Hrinchuk, S. Ding, K. Hu, J. Balam, V . Lavrukhin, and B. Ginsburg, “Emmett: Effi- cient multimodal machine translation training,”ICASSP, 2024

2024
[29]

Gra- nary: Speech recognition and translation dataset in 25 european languages.arXiv preprint arXiv:2505.13404, 2025

N. R. Koluguri, M. Sekoyan, G. Zelenfroynd, S. Meister, S. Ding, S. Kostandian, H. Huang, N. Karpov, J. Balam, V . Lavrukhin, Y . Peng, S. Papi, M. Gaido, A. Brutti, and B. Ginsburg, “Gra- nary: Speech recognition and translation dataset in 25 european languages,”Interspeech, vol. abs/2505.13404, 2025

work page arXiv 2025
[30]

Neural machine transla- tion of rare words with subword units,

R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla- tion of rare words with subword units,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016

2016
[31]

Label-looping: Highly efficient decoding for transducers,

V . Bataev, H. Xu, D. Galvez, V . Lavrukhin, and B. Ginsburg, “Label-looping: Highly efficient decoding for transducers,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 7–13

2024
[32]

Speed of light ex- act greedy decoding for rnn-t speech recognition models on gpu,

D. Galvez, V . Bataev, H. Xu, and T. Kaldewey, “Speed of light ex- act greedy decoding for rnn-t speech recognition models on gpu,” inInterspeech 2024, 2024, pp. 277–281

2024
[33]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,” ICML, 2024

2024
[34]

Canary-qwen-2.5b,

NVIDIA, “Canary-qwen-2.5b,” 2025. [Online]. Available: https: //huggingface.co/nvidia/canary-qwen-2.5b

2025