TRADE: Transducer-Augmented Decoder for Speech LLM

Shanil Puri; Shinji Watanabe; Subhabrata Mukherjee; Yun Tang

arxiv: 2606.08486 · v1 · pith:3VPCCPZCnew · submitted 2026-06-07 · 💻 cs.CL

TRADE: Transducer-Augmented Decoder for Speech LLM

Yun Tang , Shanil Puri , Shinji Watanabe , Subhabrata Mukherjee This is my paper

Pith reviewed 2026-06-27 18:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords speech LLMtransducerstreaming ASRoffline ASRlong-form speechend-of-utterance detectionchunk-synchronized training

0 comments

The pith

A transducer branch added to a speech LLM lets one checkpoint support both offline and streaming recognition across latency points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TRADE to address the lack of streaming capability in speech LLMs, whose label-synchronous generation lacks acoustic-frame alignment. It augments the LLM with a transducer branch that reuses the audio encoder and LLM hidden states as the prediction network. Three choices enable the result: tightly coupled dual vocabularies for score fusion, chunk-synchronized training to remove train-inference mismatch, and Localized Decoder Audio Attention to bound memory for long utterances. This yields 6.71 percent average WER offline and 8.40 percent streaming at 960 ms chunks from the identical checkpoint, plus long-form results without external segmentation and better end-of-utterance detection when combined with acoustic VAD.

Core claim

TRADE augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM's hidden states directly as the prediction network. With tightly coupled dual vocabularies, chunk-synchronized streaming training with gradient stopping, and Localized Decoder Audio Attention, a single checkpoint supports offline and streaming decoding across a continuous range of latency operating points, achieving 6.71 percent average WER on the Open ASR Leaderboard and 8.40 percent streaming with 960 ms chunks from the same checkpoint, along with 3.64 percent WER on TED-LIUM and 10.88 percent on Earnings-22 without external segmentation.

What carries the argument

The transducer branch that uses the LLM hidden states directly as its prediction network, together with chunk-synchronized training and Localized Decoder Audio Attention (LDAA) for causal memory control.

If this is right

A single trained model can be deployed for any chosen latency operating point without retraining.
Long utterances can be processed end-to-end without relying on external segmentation.
Sentence-end punctuation timestamps from the transducer improve end-of-utterance detection when fused with acoustic VAD.
The same architecture supports a continuous spectrum of chunk sizes while preserving linguistic reasoning from the LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production ASR systems could reduce the number of distinct models they maintain by adopting this shared-branch design.
The approach may extend to other multimodal sequence tasks that need both batch and real-time modes from one set of weights.
Further scaling the dual-vocabulary coupling could test whether fusion remains zero-cost at larger LLM vocabularies.

Load-bearing premise

The transducer branch can share the LLM hidden states and dual vocabularies without introducing train-inference mismatch or accuracy loss that would require separate models or post-hoc fixes.

What would settle it

An experiment showing that the same checkpoint cannot reach both the reported 6.71 percent offline WER and 8.40 percent streaming WER without separate training runs or post-training adjustments would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.08486 by Shanil Puri, Shinji Watanabe, Subhabrata Mukherjee, Yun Tang.

**Figure 1.** Figure 1: TRADE architecture. A shared Conformer encoder feeds both an LLM path (cross-entropy loss) and a transducer path (transducer loss); the LLM hidden states serve as the transducer prediction network via the Decoder-to-Joint Adaptor. chunk-synchronized training with a full pre-trained LLM decoder. A shared Conformer encoder feeds two parallel paths. In the LLM path, encoder outputs are projected into the LLM… view at source ↗

**Figure 2.** Figure 2: Comparison of LLM tokens and verbalized tokens. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: WER vs. Average Lagging (AL) (Ma et al., 2019) trade-off on LibriSpeech dev-other across six chunk sizes (labels in ms). AL measures how much later each token is emitted relative to an ideal same-pace policy; lower AL indicates lower latency. 6.3 Long-Form ASR [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Streaming latency metrics on LibriSpeech [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-synchronous generation has no acoustic-frame alignment, making real-time decoding and end-of-utterance detection difficult. We propose TRADE TRansducer-Augmented DEcoder, which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM's hidden states directly as the prediction network -- coupling frame-synchronous acoustic alignment with the LLM's linguistic reasoning. Three design choices make the system accurate, streamable, and long-form capable: (1)Tightly coupled dual vocabularies -- a compact transducer vocabulary derived from the LLM vocabulary, enabling zero-cost score fusion; (2)Chunk-synchronized streaming training with gradient stopping, eliminating the train-inference mismatch at offline-equivalent memory cost; and (3)Localized Decoder Audio Attention (LDAA), a causal sliding window that caps KV-cache memory independently of utterance length. A single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points. TRADE achieves 6.71% average WER on the Open ASR Leaderboard, while the streaming recognition with 960ms chunk size reaches 8.40% from the same checkpoint. On long-form speech, it obtains 3.64% WER on TED-LIUM and 10.88% on Earnings-22 without external segmentation. TRADE provides sentence-end punctuation timestamps that, when combined with acoustic voice activity detection (VAD), improve end-of-utterance detection by +0.03 F_1 over acoustic VAD alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRADE adds a transducer branch reusing LLM states for streaming Speech LLMs with one checkpoint, but the evidence for no mismatch and continuous latency is thin without ablations.

read the letter

The main point is that TRADE attaches a transducer to a multimodal LLM so the same checkpoint can do both offline and streaming ASR. It reuses the LLM hidden states as the transducer prediction network, adds dual-vocabulary fusion, chunk-synchronized training with gradient stopping, and LDAA to bound memory.

What is new is the specific coupling: the transducer shares the audio encoder and LLM states directly, the compact transducer vocab comes from the LLM vocab for zero-cost fusion, and the training uses gradient stopping to avoid mismatch at offline memory cost. LDAA caps the KV cache with a causal window. The paper reports 6.71% average WER on the Open ASR Leaderboard offline and 8.40% at 960 ms chunk size from the identical model, plus 3.64% on TED-LIUM and 10.88% on Earnings-22 without segmentation, and a small gain in end-of-utterance detection when punctuation timestamps are added to VAD.

The work is useful for anyone who needs practical streaming on top of existing Speech LLMs. The architecture choices address the label-synchronous problem head-on and the numbers are on public benchmarks.

The soft spots are the missing experimental details. The abstract gives results but no ablations on gradient stopping, no mismatch metric between offline and streaming passes, and only two latency points rather than a sweep across chunk sizes. The stress-test concern about residual train-inference mismatch therefore lands; if gradient stopping blocks full flow through the shared states, the continuous-range claim is harder to accept on the current evidence. Full paper experiments would need to show those controls to make the central claim solid.

This is for ASR engineers working on real-time deployment of large speech models. A reader who cares about streaming latency trade-offs could extract concrete design ideas. It is coherent enough on its own terms to deserve a serious referee who can check the ablations and reproducibility.

Referee Report

2 major / 2 minor

Summary. The paper proposes TRADE, a transducer-augmented decoder for Speech LLMs. It augments a multimodal LLM with a transducer branch that shares the audio encoder and uses LLM hidden states as the prediction network, combined with tightly coupled dual vocabularies for zero-cost fusion, chunk-synchronized streaming training with gradient stopping to eliminate train-inference mismatch, and Localized Decoder Audio Attention (LDAA) for bounded KV-cache memory. A single checkpoint is claimed to support both offline and streaming decoding over a continuous latency range, achieving 6.71% average WER on the Open ASR Leaderboard (offline) and 8.40% at 960 ms chunk size (streaming), plus strong long-form results on TED-LIUM (3.64%) and Earnings-22 (10.88%) without external segmentation, and improved end-of-utterance detection via punctuation timestamps.

Significance. If the single-checkpoint claim and mismatch elimination hold under rigorous verification, the work offers a principled integration of frame-synchronous transducer alignment with LLM linguistic reasoning, enabling flexible latency operating points without separate models or high memory overhead. This would be a meaningful advance for practical deployment of Speech LLMs in real-time and long-form ASR.

major comments (2)

[Abstract] Abstract: the central claim that 'a single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points' rests on chunk-synchronized training with gradient stopping eliminating mismatch, yet only two operating points are reported (6.71% offline, 8.40% at 960 ms) with no results shown for intermediate chunk sizes and no ablation isolating the gradient-stopping component.
[Abstract] Abstract: no explicit train-inference mismatch metric (e.g., divergence between offline and streaming forward passes on identical inputs) or ablation on the shared transducer branch using LLM hidden states is supplied, leaving the effectiveness of gradient stopping and dual-vocabulary fusion unverified despite being load-bearing for the single-checkpoint architecture.

minor comments (2)

[Abstract] Abstract: the Open ASR Leaderboard WER comparison lacks explicit baseline models, test-set breakdown, or error analysis to contextualize the reported gains.
[Abstract] Abstract: LDAA is described only at high level; a concrete definition of the causal sliding window and its interaction with the transducer branch would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of the single-checkpoint architecture. We address each major comment below and will incorporate revisions to provide stronger empirical support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'a single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points' rests on chunk-synchronized training with gradient stopping eliminating mismatch, yet only two operating points are reported (6.71% offline, 8.40% at 960 ms) with no results shown for intermediate chunk sizes and no ablation isolating the gradient-stopping component.

Authors: We agree that the current reporting of only the offline and 960 ms points provides limited direct evidence for continuous latency coverage. The chunk-synchronized training with gradient stopping is designed to support arbitrary chunk sizes by aligning train and inference distributions. In revision we will add WER results for intermediate chunk sizes (320 ms, 480 ms, 640 ms) on the Open ASR Leaderboard and include an ablation that isolates gradient stopping by comparing the full method against a variant without gradient stopping. revision: yes
Referee: [Abstract] Abstract: no explicit train-inference mismatch metric (e.g., divergence between offline and streaming forward passes on identical inputs) or ablation on the shared transducer branch using LLM hidden states is supplied, leaving the effectiveness of gradient stopping and dual-vocabulary fusion unverified despite being load-bearing for the single-checkpoint architecture.

Authors: We acknowledge that an explicit mismatch metric and targeted ablations on the shared transducer branch and dual-vocabulary fusion are absent. We will add (1) a quantitative mismatch metric (token-level prediction divergence and alignment error between offline and streaming forward passes on identical inputs), (2) an ablation replacing LLM hidden states with a dedicated prediction network, and (3) an ablation comparing tightly-coupled dual-vocabulary fusion against independent scoring. These will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and results are independently constructed

full rationale

The paper introduces TRADE as a new architectural augmentation to Speech LLMs, specifying three concrete design choices (dual vocabularies, chunk-synchronized training with gradient stopping, and LDAA) and reporting empirical WER numbers on external benchmarks (Open ASR Leaderboard, TED-LIUM, Earnings-22). No equations, derivations, or parameter-fitting steps are described that reduce the central claims (single-checkpoint offline/streaming support, continuous latency range) back to the inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing justifications. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the design relies on standard assumptions in transducer and LLM training that are not enumerated here.

pith-pipeline@v0.9.1-grok · 5816 in / 1166 out tokens · 20481 ms · 2026-06-27T18:38:41.439769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

116 extracted references · 40 canonical work pages · 11 internal anchors

[1]

Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel

N. Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel. 2019. Monotonic infinite lookback attention for simultaneous machine translation. In ACL

2019
[2]

Ye Bai, Jingping Chen, Jitong Chen, and 1 others. 2024. https://arxiv.org/abs/2407.04675 https://arxiv.org/abs/2407.04675 Seed-ASR : Understanding Diverse Speech and Contexts with LLM -Based Speech Recognition . Preprint, arXiv:2407.04675

work page arXiv 2024
[3]

Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, and Boris Ginsburg

Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, and Boris Ginsburg. 2024 a . https://arxiv.org/abs/2310.09424 SALM : Speech-Augmented Language Model with In-Context Learning for Speech Recognition and Translation . In Proc. ICASSP

work page arXiv 2024
[4]

Puvvada, Nithin Rao Koluguri, Piotr \.Z elasko, Jagadeesh Balam, and Boris Ginsburg

Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr \.Z elasko, Jagadeesh Balam, and Boris Ginsburg. 2024 b . https://arxiv.org/abs/2406.19954 BESTOW : Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 . In Proc. SLT

work page arXiv 2024
[7]

Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. https://arxiv.org/abs/2410.00037 Moshi: A Speech-Text Foundation Model for Real-Time Dialogue https://arxiv.org/abs/2410.00037 . Preprint, arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Woodland

Keqi Deng, Wenxi Chen, Xie Chen, and Philip C. Woodland. 2025. https://arxiv.org/abs/2504.15509 SimulS2S-LLM : Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation . In Proc. ACL

work page arXiv 2025
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Alex Graves, Abdel rahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks https://arxiv.org/abs/1303.5778. In Proc. ICASSP

work page internal anchor Pith review Pith/arXiv arXiv 2013
[11]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-Augmented Transformer for Speech Recognition https://arxiv.org/abs/2005.08100. In Proc. Interspeech

work page arXiv 2020
[12]

Ankit Gupta, George Saon, and Brian Kingsbury. 2024. Exploring the Limits of Decoder-Only Models Trained on Public Speech Recognition Corpora https://arxiv.org/abs/2402.00235. In Proc. Interspeech

work page arXiv 2024
[13]

François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève. 2018. https://arxiv.org/abs/1805.04699 TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation . In Proc. SPECOM

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://arxiv.org/abs/2106.09685 LoRA : Low-Rank Adaptation of Large Language Models . In Proc. ICLR

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Nithin Rao Koluguri, Monica Sekoyan, Ante Jukić, Somshubra Majumdar, Vitaly Lavrukhin, Jagadeesh Balam, and Boris Ginsburg. 2025 a . https://arxiv.org/abs/2509.14128 https://arxiv.org/abs/2509.14128 Canary-1B-v2 & Parakeet- TDT -0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST . Preprint, arXiv:2509.14128

work page arXiv 2025
[16]

Nithin Rao Koluguri, Monica Sekoyan, Gilad Zelenfroynd, Slava Meister, Shangshang Ding, Sergei Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yi Peng, Sara Papi, Marco Gaido, Adriano Brutti, and Boris Ginsburg. 2025 b . https://arxiv.org/abs/2505.13404 Granary: Speech Recognition and Translation Dataset in 25 European Languages h...

work page arXiv 2025
[17]

Nithin Rao Koluguri, Georgy Zelenfroind, Vitaly Lavrukhin, Jagadeesh Balam, and Boris Ginsburg. 2024. https://arxiv.org/abs/2309.09950 Investigating End-to-End ASR Architectures for Long Form Audio Transcription . In Proc. ICASSP

work page arXiv 2024
[18]

Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, and Daniel Povey. 2022. https://arxiv.org/abs/2206.13236 Pruned RNN-T for Fast, Memory-Efficient ASR Training . In Proc. Interspeech

work page arXiv 2022
[19]

Seltzer, and Christian Fuegen

Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, and Christian Fuegen. 2024. End-to-End Speech Recognition Contextualization with Large Language Models https://arxiv.org/abs/2309.10917. In Proc. ICASSP, pages 12406--12410

work page arXiv 2024
[20]

Dan Liu, Mengge Du, Xiaoxi Li, Ya Li, and Enhong Chen. 2021. Cross Attention Augmented Transducer Networks for Simultaneous Translation https://aclanthology.org/2021.emnlp-main.4. In Proc. EMNLP

2021
[21]

Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2019. https://aclanthology.org/P19-1289 STACL : Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework . In Proc. ACL

2019
[22]

Xutai Ma, Juan Pino, James Cross, Liezl Puzon, and Jiatao Gu. 2020 a . Monotonic multihead attention. In ICLR

2020
[23]

Di Gangi, Sara Papi, Luisa Bentivogli, Marcello Federico, and Philipp Koehn

Xutai Ma, Mohammad Javad Salameh, Ljiljana Majstorovic, Elena Meylan, Roldano Cattoni, Mattia A. Di Gangi, Sara Papi, Luisa Bentivogli, Marcello Federico, and Philipp Koehn. 2020 b . https://aclanthology.org/2020.emnlp-demos.19 SimulEval : An Evaluation Toolkit for Simultaneous Translation . In Proc. EMNLP (Demo)

2020
[24]

Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, and Xie Chen. 2024. https://arxiv.org/abs/2402.08846 https://arxiv.org/abs/2402.08846 An Embarrassingly Simple Approach for LLM with Strong ASR Capacity . Preprint, arXiv:2402.08846

work page arXiv 2024
[25]

Iain McCowan, Jean Carletta, Wessel Kraaij, Simone Ashby, Samuel Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, and 1 others. 2005. The AMI meeting corpus. In Proc. International Conference on Methods and Techniques in Behavioral Research

2005
[26]

Takafumi Moriya, Masato Mimura, Tomohiro Tanaka, Hiroshi Sato, Ryo Masumura, and Atsunori Ogawa. 2024. https://arxiv.org/abs/2512.11543 https://arxiv.org/abs/2512.11543 All-in-One ASR : Unifying Encoder-Decoder Models of CTC , attention, and transducer in dual-mode ASR . Preprint, arXiv:2512.11543

work page arXiv 2024
[27]

Spirit LM: Interleaved spoken and written language model,

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-juss \`a , Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux. 2025. https://arxiv.org/abs/2402.05755 SpiRit-LM : Interleaved Spoken and Written Language Model . Transactions of the Associati...

work page arXiv 2025
[28]

O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Nathaniel Macedo, and 1 others

Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Nathaniel Macedo, and 1 others. 2021. https://arxiv.org/abs/2104.02014 https://arxiv.org/abs/2104.02014 SPGI Speech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Re...

work page arXiv 2021
[29]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech : An ASR corpus based on public domain audio books. In Proc. ICASSP

2015
[30]

Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. https://arxiv.org/abs/1904.08779 SpecAugment : A Simple Data Augmentation Method for Automatic Speech Recognition . In Proc. Interspeech

work page arXiv 2019
[31]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. https://arxiv.org/abs/2012.03411 MLS : A Large-Scale Multilingual Dataset for Speech Research . In Proc. Interspeech

work page internal anchor Pith review Pith/arXiv arXiv 2020
[32]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision https://arxiv.org/abs/2212.04356. In Proc. ICML

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Dima Rekesh, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, Henry Juang, Oleksii Hrinchuk, Ankur Kumar, and Boris Ginsburg. 2023. https://arxiv.org/abs/2305.05084 Fast conformer with linearly scalable attention for efficient speech recognition . ASRU

work page arXiv 2023
[34]

Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Liu, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Zelasko, and Miguel Jett \' e . 2022. https://arxiv.org/abs/2203.15591 Earnings-22: A Practical Benchmark for Accents in the Wild https://arxiv.org/abs/2203.15591 . Preprint, arXiv:2203.15591

work page arXiv 2022
[35]

Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra. 2021. https://arxiv.org/abs/2104.11348 https://arxiv.org/abs/2104.11348 Earnings-21: A Practical Benchmark for ASR in the Wild . Preprint, arXiv:2104.11348

work page arXiv 2021
[36]

Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, and Chunyang Wu. 2024. https://arxiv.org/abs/2406.09569 Speech ReaLLM : Real-Time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time . In Proc. Interspeech

work page arXiv 2024
[37]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units https://arxiv.org/abs/1508.07909. In Proc. ACL

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Silero Team . 2021. https://github.com/snakers4/silero-vad Silero VAD : Pre-trained Enterprise-Grade Voice Activity Detector . https://github.com/snakers4/silero-vad

2021
[39]

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan , Nithin Rao Koluguri, Piotr \.Z elasko, Somshubra Majumdar, Adel Moumen, and Sanchit Gandhi. 2025. https://arxiv.org/abs/2510.06961 https://arxiv.org/abs/2510.06961 Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation . Preprint,...

work page arXiv 2025
[40]

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2024. https://arxiv.org/abs/2310.13289 SALMONN : Towards Generic Hearing Abilities for Large Language Models . In Proc. ICLR

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Yun Tang, Eesung Kim, and Vijendra Raj Apsingekar. 2025. Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data https://arxiv.org/abs/2506.19159. In Proc. Interspeech

work page arXiv 2025
[42]

Yun Tang, Anna Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai Ma, Paden Tomasello, and Juan Pino. 2023. Hybrid Transducer and Attention Based Encoder-Decoder Modeling for Speech-to-Text Tasks https://arxiv.org/abs/2305.03101. In Proc. ACL

work page arXiv 2023
[44]

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. 2021. https://doi.org/10.18653/v1/2021.acl-long.80 https://aclanthology.org/2021.acl-long.80/ VoxPopuli : A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation . I...

work page doi:10.18653/v1/2021.acl-long.80 2021
[45]

Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, and Yonghui Wu

Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul K. Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, and Yonghui Wu. 2023. https://arxiv.org/abs/2310.00230 SLM : Bridge the Thin Gap Between Speech and Text Foundation Models ....

work page arXiv 2023
[46]

Hershey, and Tomoki Hayashi

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, and Tomoki Hayashi. 2017. https://doi.org/10.1109/JSTSP.2017.2763455 https://doi.org/10.1109/JSTSP.2017.2763455 Hybrid CTC/Attention Architecture for End-to-End Speech Recognition . IEEE Journal of Selected Topics in Signal Processing , 11(8):1240--1253

work page doi:10.1109/jstsp.2017.2763455 2017
[47]

Akmal Haidar, Nicola Ferri, Jes'us Andr'es-Ferrer, and Puming Zhan

Felix Weninger, Marco Gaudesi, Md. Akmal Haidar, Nicola Ferri, Jes'us Andr'es-Ferrer, and Puming Zhan. 2022. Conformer with dual-mode chunked attention for joint online and offline asr. In Interspeech

2022
[48]

Zhifei Xie and Changqiao Wu. 2024. https://arxiv.org/abs/2408.16725 https://arxiv.org/abs/2408.16725 Mini- O mni: Language Models Can Hear, Talk While Thinking in Streaming . Preprint, arXiv:2408.16725

work page arXiv 2024
[49]

Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, and Xin Lei

Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, F. Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, and Xin Lei. 2020. Unified streaming and non-streaming two-pass end-to-end model for speech recognition. ArXiv, abs/2012.05481

work page arXiv 2020
[50]

Alex Graves and Abdel-rahman Mohamed and Geoffrey Hinton , title =. Proc. ICASSP , year =
[51]

Fangjun Kuang and Liyong Guo and Wei Kang and Long Lin and Mingshuang Luo and Zengwei Yao and Daniel Povey , title =. Proc. Interspeech , year =
[52]

Qian Zhang and Han Lu and Hasim Sak and Anshuman Tripathi and Erik McDermott and Khe Chai Sim and Shankar Kumar , title =. Proc. ICASSP , year =
[53]

Faris Khalil Botros and Thibault de Boissiere and Ha Nguyen and Imran Sheikh , title =. Proc. Interspeech , year =
[54]

2023 , eprint =

Hainan Xu and Fangjun Kuang and Liyong Guo and Yifan Yang and Long Lin and Hao Wen and Hao Yao and Daniel Povey , title =. 2023 , eprint =

2023
[55]

Anmol Gulati and James Qin and Chung-Cheng Chiu and Niki Parmar and Yu Zhang and Jiahui Yu and Wei Han and Shibo Wang and Zhengdong Zhang and Yonghui Wu and Ruoming Pang , title =. Proc. Interspeech , year =
[56]

Park and William Chan and Yu Zhang and Chung-Cheng Chiu and Barret Zoph and Ekin D

Daniel S. Park and William Chan and Yu Zhang and Chung-Cheng Chiu and Barret Zoph and Ekin D. Cubuk and Quoc V. Le , title =. Proc. Interspeech , year =
[57]

Hershey and Tomoki Hayashi , title =

Shinji Watanabe and Takaaki Hori and Suyoun Kim and John R. Hershey and Tomoki Hayashi , title =. 2017 , doi =

2017
[58]

Yun Tang and Anna Sun and Hirofumi Inaguma and Xinyue Chen and Ning Dong and Xutai Ma and Paden Tomasello and Juan Pino , title =. Proc. ACL , year =
[59]

Dan Liu and Mengge Du and Xiaoxi Li and Ya Li and Enhong Chen , title =. Proc. EMNLP , year =
[60]

arXiv preprint arXiv:2509.15579 , year =

Yun Tang and Cindy Tseng , title =. arXiv preprint arXiv:2509.15579 , year =

work page arXiv
[61]

Yun Tang and Eesung Kim and Vijendra Raj Apsingekar , title =. Proc. Interspeech , year =
[62]

Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever , title =. Proc. ICML , year =
[63]

Vassil Panayotov and Guoguo Chen and Daniel Povey and Sanjeev Khudanpur , title =. Proc. ICASSP , year =
[64]

Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , title =

Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , title =. Proc. ICLR , year =
[65]

Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi , title =. Proc. ICML , year =
[66]

2015 , eprint =

Caglar Gulcehre and Orhan Firat and Kelvin Xu and Kyunghyun Cho and Loic Barrault and Huei-Chi Lin and Fethi Bougares and Holger Schwenk and Yoshua Bengio , title =. 2015 , eprint =

2015
[67]

Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun Ma and Chao Zhang , title =. Proc. ICLR , year =
[68]

2023 , eprint =

Yunfei Chu and Jin Xu and Xiaohuan Zhou and Qian Yang and Shiliang Zhang and Zhijie Yan and Chang Zhou and Jingren Zhou , title =. 2023 , eprint =

2023
[69]

Rubenstein and Chulayuth Asawaroengchai and Duc Dung Nguyen and Ankur Bapna and Zal

Paul K. Rubenstein and Chulayuth Asawaroengchai and Duc Dung Nguyen and Ankur Bapna and Zal. 2023 , eprint =

2023
[70]

2024 , eprint =

Shengpeng Ji and Chaofan Tian and Minghui Fang and Jialong Zuo and Jiawei Chen and Zhengqi Wen and Baolong Bi and Zu-Yu Kan and Tao Jin and Zhou Zhao , title =. 2024 , eprint =

2024
[71]

Soham Deshmukh and Benjamin Elizalde and Rita Singh and Huaming Wang , title =. Proc. NeurIPS , year =
[72]

Jan Chorowski and Dzmitry Bahdanau and Dmitriy Serdyuk and Kyunghyun Cho and Yoshua Bengio , title =. Proc. NeurIPS , year =
[73]

Le and Oriol Vinyals , title =

William Chan and Navdeep Jaitly and Quoc V. Le and Oriol Vinyals , title =. Proc. ICASSP , year =
[74]

2025 , eprint =

Vaibhav Srivastav and Steven Zheng and Eric Bezzam and Eustache. 2025 , eprint =

2025
[75]

2024 , eprint =

Ziyang Ma and Guanrou Yang and Yifan Yang and Zhifu Gao and Jiaming Wang and Zhihao Du and Fan Yu and Qian Chen and Siqi Zheng and Shiliang Zhang and Xie Chen , title =. 2024 , eprint =

2024
[76]

Rubenstein and Lukas Zilka and Dian Yu and Zhong Meng and Golan Pundak and Nikhil Siddhartha and Johan Schalkwyk and Yonghui Wu , title =

Mingqiu Wang and Wei Han and Izhak Shafran and Zelin Wu and Chung-Cheng Chiu and Yuan Cao and Yongqiang Wang and Nanxin Chen and Yu Zhang and Hagen Soltau and Paul K. Rubenstein and Lukas Zilka and Dian Yu and Zhong Meng and Golan Pundak and Nikhil Siddhartha and Johan Schalkwyk and Yonghui Wu , title =. Proc. ASRU , year =
[77]

Puvvada and Jason Li and Subhankar Ghosh and Jagadeesh Balam and Boris Ginsburg , title =

Zhehuai Chen and He Huang and Andrei Andrusenko and Oleksii Hrinchuk and Krishna C. Puvvada and Jason Li and Subhankar Ghosh and Jagadeesh Balam and Boris Ginsburg , title =. Proc. ICASSP , year =
[78]

Wenyi Yu and Changli Tang and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun Ma and Chao Zhang , title =. Proc. ICASSP , year =
[79]

2024 , eprint =

Ye Bai and Jingping Chen and Jitong Chen and others , title =. 2024 , eprint =

2024
[80]

Francesco Verdini and Danni Liu and Jan Niehues and Marco Gaido and Luisa Bentivogli , title =. Proc. Interspeech , year =
[81]

Tsz Kin Lam and Marco Gaido and Sara Papi and Luisa Bentivogli and Barry Haddow , title =. Proc. NAACL , year =
[82]

Dominik Wagner and Alexander Churchill and Siddharth Sigtia and Erik Marchi , title =. Proc. ICASSP , year =
[83]

Ankit Gupta and George Saon and Brian Kingsbury , title =. Proc. Interspeech , year =

Showing first 80 references.

[1] [1]

Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel

N. Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raffel. 2019. Monotonic infinite lookback attention for simultaneous machine translation. In ACL

2019

[2] [2]

Ye Bai, Jingping Chen, Jitong Chen, and 1 others. 2024. https://arxiv.org/abs/2407.04675 https://arxiv.org/abs/2407.04675 Seed-ASR : Understanding Diverse Speech and Contexts with LLM -Based Speech Recognition . Preprint, arXiv:2407.04675

work page arXiv 2024

[3] [3]

Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, and Boris Ginsburg

Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, and Boris Ginsburg. 2024 a . https://arxiv.org/abs/2310.09424 SALM : Speech-Augmented Language Model with In-Context Learning for Speech Recognition and Translation . In Proc. ICASSP

work page arXiv 2024

[4] [4]

Puvvada, Nithin Rao Koluguri, Piotr \.Z elasko, Jagadeesh Balam, and Boris Ginsburg

Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr \.Z elasko, Jagadeesh Balam, and Boris Ginsburg. 2024 b . https://arxiv.org/abs/2406.19954 BESTOW : Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5 . In Proc. SLT

work page arXiv 2024

[5] [7]

Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. https://arxiv.org/abs/2410.00037 Moshi: A Speech-Text Foundation Model for Real-Time Dialogue https://arxiv.org/abs/2410.00037 . Preprint, arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [8]

Woodland

Keqi Deng, Wenxi Chen, Xie Chen, and Philip C. Woodland. 2025. https://arxiv.org/abs/2504.15509 SimulS2S-LLM : Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation . In Proc. ACL

work page arXiv 2025

[7] [9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [10]

Alex Graves, Abdel rahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks https://arxiv.org/abs/1303.5778. In Proc. ICASSP

work page internal anchor Pith review Pith/arXiv arXiv 2013

[9] [11]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-Augmented Transformer for Speech Recognition https://arxiv.org/abs/2005.08100. In Proc. Interspeech

work page arXiv 2020

[10] [12]

Ankit Gupta, George Saon, and Brian Kingsbury. 2024. Exploring the Limits of Decoder-Only Models Trained on Public Speech Recognition Corpora https://arxiv.org/abs/2402.00235. In Proc. Interspeech

work page arXiv 2024

[11] [13]

François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève. 2018. https://arxiv.org/abs/1805.04699 TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation . In Proc. SPECOM

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [14]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://arxiv.org/abs/2106.09685 LoRA : Low-Rank Adaptation of Large Language Models . In Proc. ICLR

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [15]

Nithin Rao Koluguri, Monica Sekoyan, Ante Jukić, Somshubra Majumdar, Vitaly Lavrukhin, Jagadeesh Balam, and Boris Ginsburg. 2025 a . https://arxiv.org/abs/2509.14128 https://arxiv.org/abs/2509.14128 Canary-1B-v2 & Parakeet- TDT -0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST . Preprint, arXiv:2509.14128

work page arXiv 2025

[14] [16]

Nithin Rao Koluguri, Monica Sekoyan, Gilad Zelenfroynd, Slava Meister, Shangshang Ding, Sergei Kostandian, He Huang, Nikolay Karpov, Jagadeesh Balam, Vitaly Lavrukhin, Yi Peng, Sara Papi, Marco Gaido, Adriano Brutti, and Boris Ginsburg. 2025 b . https://arxiv.org/abs/2505.13404 Granary: Speech Recognition and Translation Dataset in 25 European Languages h...

work page arXiv 2025

[15] [17]

Nithin Rao Koluguri, Georgy Zelenfroind, Vitaly Lavrukhin, Jagadeesh Balam, and Boris Ginsburg. 2024. https://arxiv.org/abs/2309.09950 Investigating End-to-End ASR Architectures for Long Form Audio Transcription . In Proc. ICASSP

work page arXiv 2024

[16] [18]

Fangjun Kuang, Liyong Guo, Wei Kang, Long Lin, Mingshuang Luo, Zengwei Yao, and Daniel Povey. 2022. https://arxiv.org/abs/2206.13236 Pruned RNN-T for Fast, Memory-Efficient ASR Training . In Proc. Interspeech

work page arXiv 2022

[17] [19]

Seltzer, and Christian Fuegen

Egor Lakomkin, Chunyang Wu, Yassir Fathullah, Ozlem Kalinli, Michael L. Seltzer, and Christian Fuegen. 2024. End-to-End Speech Recognition Contextualization with Large Language Models https://arxiv.org/abs/2309.10917. In Proc. ICASSP, pages 12406--12410

work page arXiv 2024

[18] [20]

Dan Liu, Mengge Du, Xiaoxi Li, Ya Li, and Enhong Chen. 2021. Cross Attention Augmented Transducer Networks for Simultaneous Translation https://aclanthology.org/2021.emnlp-main.4. In Proc. EMNLP

2021

[19] [21]

Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and Haifeng Wang. 2019. https://aclanthology.org/P19-1289 STACL : Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework . In Proc. ACL

2019

[20] [22]

Xutai Ma, Juan Pino, James Cross, Liezl Puzon, and Jiatao Gu. 2020 a . Monotonic multihead attention. In ICLR

2020

[21] [23]

Di Gangi, Sara Papi, Luisa Bentivogli, Marcello Federico, and Philipp Koehn

Xutai Ma, Mohammad Javad Salameh, Ljiljana Majstorovic, Elena Meylan, Roldano Cattoni, Mattia A. Di Gangi, Sara Papi, Luisa Bentivogli, Marcello Federico, and Philipp Koehn. 2020 b . https://aclanthology.org/2020.emnlp-demos.19 SimulEval : An Evaluation Toolkit for Simultaneous Translation . In Proc. EMNLP (Demo)

2020

[22] [24]

Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, and Xie Chen. 2024. https://arxiv.org/abs/2402.08846 https://arxiv.org/abs/2402.08846 An Embarrassingly Simple Approach for LLM with Strong ASR Capacity . Preprint, arXiv:2402.08846

work page arXiv 2024

[23] [25]

Iain McCowan, Jean Carletta, Wessel Kraaij, Simone Ashby, Samuel Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, and 1 others. 2005. The AMI meeting corpus. In Proc. International Conference on Methods and Techniques in Behavioral Research

2005

[24] [26]

Takafumi Moriya, Masato Mimura, Tomohiro Tanaka, Hiroshi Sato, Ryo Masumura, and Atsunori Ogawa. 2024. https://arxiv.org/abs/2512.11543 https://arxiv.org/abs/2512.11543 All-in-One ASR : Unifying Encoder-Decoder Models of CTC , attention, and transducer in dual-mode ASR . Preprint, arXiv:2512.11543

work page arXiv 2024

[25] [27]

Spirit LM: Interleaved spoken and written language model,

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-juss \`a , Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux. 2025. https://arxiv.org/abs/2402.05755 SpiRit-LM : Interleaved Spoken and Written Language Model . Transactions of the Associati...

work page arXiv 2025

[26] [28]

O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Nathaniel Macedo, and 1 others

Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Nathaniel Macedo, and 1 others. 2021. https://arxiv.org/abs/2104.02014 https://arxiv.org/abs/2104.02014 SPGI Speech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Re...

work page arXiv 2021

[27] [29]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech : An ASR corpus based on public domain audio books. In Proc. ICASSP

2015

[28] [30]

Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. 2019. https://arxiv.org/abs/1904.08779 SpecAugment : A Simple Data Augmentation Method for Automatic Speech Recognition . In Proc. Interspeech

work page arXiv 2019

[29] [31]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. 2020. https://arxiv.org/abs/2012.03411 MLS : A Large-Scale Multilingual Dataset for Speech Research . In Proc. Interspeech

work page internal anchor Pith review Pith/arXiv arXiv 2020

[30] [32]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision https://arxiv.org/abs/2212.04356. In Proc. ICML

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [33]

Dima Rekesh, Samuel Kriman, Somshubra Majumdar, Vahid Noroozi, Henry Juang, Oleksii Hrinchuk, Ankur Kumar, and Boris Ginsburg. 2023. https://arxiv.org/abs/2305.05084 Fast conformer with linearly scalable attention for efficient speech recognition . ASRU

work page arXiv 2023

[32] [34]

Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Liu, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Zelasko, and Miguel Jett \' e . 2022. https://arxiv.org/abs/2203.15591 Earnings-22: A Practical Benchmark for Accents in the Wild https://arxiv.org/abs/2203.15591 . Preprint, arXiv:2203.15591

work page arXiv 2022

[33] [35]

Miguel Del Rio, Peter Ha, Quinten McNamara, Corey Miller, and Shipra Chandra. 2021. https://arxiv.org/abs/2104.11348 https://arxiv.org/abs/2104.11348 Earnings-21: A Practical Benchmark for ASR in the Wild . Preprint, arXiv:2104.11348

work page arXiv 2021

[34] [36]

Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, and Chunyang Wu. 2024. https://arxiv.org/abs/2406.09569 Speech ReaLLM : Real-Time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time . In Proc. Interspeech

work page arXiv 2024

[35] [37]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units https://arxiv.org/abs/1508.07909. In Proc. ACL

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [38]

Silero Team . 2021. https://github.com/snakers4/silero-vad Silero VAD : Pre-trained Enterprise-Grade Voice Activity Detector . https://github.com/snakers4/silero-vad

2021

[37] [39]

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan , Nithin Rao Koluguri, Piotr \.Z elasko, Somshubra Majumdar, Adel Moumen, and Sanchit Gandhi. 2025. https://arxiv.org/abs/2510.06961 https://arxiv.org/abs/2510.06961 Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation . Preprint,...

work page arXiv 2025

[38] [40]

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2024. https://arxiv.org/abs/2310.13289 SALMONN : Towards Generic Hearing Abilities for Large Language Models . In Proc. ICLR

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [41]

Yun Tang, Eesung Kim, and Vijendra Raj Apsingekar. 2025. Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data https://arxiv.org/abs/2506.19159. In Proc. Interspeech

work page arXiv 2025

[40] [42]

Yun Tang, Anna Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai Ma, Paden Tomasello, and Juan Pino. 2023. Hybrid Transducer and Attention Based Encoder-Decoder Modeling for Speech-to-Text Tasks https://arxiv.org/abs/2305.03101. In Proc. ACL

work page arXiv 2023

[41] [44]

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. 2021. https://doi.org/10.18653/v1/2021.acl-long.80 https://aclanthology.org/2021.acl-long.80/ VoxPopuli : A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation . I...

work page doi:10.18653/v1/2021.acl-long.80 2021

[42] [45]

Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, and Yonghui Wu

Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul K. Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, and Yonghui Wu. 2023. https://arxiv.org/abs/2310.00230 SLM : Bridge the Thin Gap Between Speech and Text Foundation Models ....

work page arXiv 2023

[43] [46]

Hershey, and Tomoki Hayashi

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, and Tomoki Hayashi. 2017. https://doi.org/10.1109/JSTSP.2017.2763455 https://doi.org/10.1109/JSTSP.2017.2763455 Hybrid CTC/Attention Architecture for End-to-End Speech Recognition . IEEE Journal of Selected Topics in Signal Processing , 11(8):1240--1253

work page doi:10.1109/jstsp.2017.2763455 2017

[44] [47]

Akmal Haidar, Nicola Ferri, Jes'us Andr'es-Ferrer, and Puming Zhan

Felix Weninger, Marco Gaudesi, Md. Akmal Haidar, Nicola Ferri, Jes'us Andr'es-Ferrer, and Puming Zhan. 2022. Conformer with dual-mode chunked attention for joint online and offline asr. In Interspeech

2022

[45] [48]

Zhifei Xie and Changqiao Wu. 2024. https://arxiv.org/abs/2408.16725 https://arxiv.org/abs/2408.16725 Mini- O mni: Language Models Can Hear, Talk While Thinking in Streaming . Preprint, arXiv:2408.16725

work page arXiv 2024

[46] [49]

Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, and Xin Lei

Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, F. Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, and Xin Lei. 2020. Unified streaming and non-streaming two-pass end-to-end model for speech recognition. ArXiv, abs/2012.05481

work page arXiv 2020

[47] [50]

Alex Graves and Abdel-rahman Mohamed and Geoffrey Hinton , title =. Proc. ICASSP , year =

[48] [51]

Fangjun Kuang and Liyong Guo and Wei Kang and Long Lin and Mingshuang Luo and Zengwei Yao and Daniel Povey , title =. Proc. Interspeech , year =

[49] [52]

Qian Zhang and Han Lu and Hasim Sak and Anshuman Tripathi and Erik McDermott and Khe Chai Sim and Shankar Kumar , title =. Proc. ICASSP , year =

[50] [53]

Faris Khalil Botros and Thibault de Boissiere and Ha Nguyen and Imran Sheikh , title =. Proc. Interspeech , year =

[51] [54]

2023 , eprint =

Hainan Xu and Fangjun Kuang and Liyong Guo and Yifan Yang and Long Lin and Hao Wen and Hao Yao and Daniel Povey , title =. 2023 , eprint =

2023

[52] [55]

Anmol Gulati and James Qin and Chung-Cheng Chiu and Niki Parmar and Yu Zhang and Jiahui Yu and Wei Han and Shibo Wang and Zhengdong Zhang and Yonghui Wu and Ruoming Pang , title =. Proc. Interspeech , year =

[53] [56]

Park and William Chan and Yu Zhang and Chung-Cheng Chiu and Barret Zoph and Ekin D

Daniel S. Park and William Chan and Yu Zhang and Chung-Cheng Chiu and Barret Zoph and Ekin D. Cubuk and Quoc V. Le , title =. Proc. Interspeech , year =

[54] [57]

Hershey and Tomoki Hayashi , title =

Shinji Watanabe and Takaaki Hori and Suyoun Kim and John R. Hershey and Tomoki Hayashi , title =. 2017 , doi =

2017

[55] [58]

Yun Tang and Anna Sun and Hirofumi Inaguma and Xinyue Chen and Ning Dong and Xutai Ma and Paden Tomasello and Juan Pino , title =. Proc. ACL , year =

[56] [59]

Dan Liu and Mengge Du and Xiaoxi Li and Ya Li and Enhong Chen , title =. Proc. EMNLP , year =

[57] [60]

arXiv preprint arXiv:2509.15579 , year =

Yun Tang and Cindy Tseng , title =. arXiv preprint arXiv:2509.15579 , year =

work page arXiv

[58] [61]

Yun Tang and Eesung Kim and Vijendra Raj Apsingekar , title =. Proc. Interspeech , year =

[59] [62]

Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever , title =. Proc. ICML , year =

[60] [63]

Vassil Panayotov and Guoguo Chen and Daniel Povey and Sanjeev Khudanpur , title =. Proc. ICASSP , year =

[61] [64]

Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , title =

Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , title =. Proc. ICLR , year =

[62] [65]

Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi , title =. Proc. ICML , year =

[63] [66]

2015 , eprint =

Caglar Gulcehre and Orhan Firat and Kelvin Xu and Kyunghyun Cho and Loic Barrault and Huei-Chi Lin and Fethi Bougares and Holger Schwenk and Yoshua Bengio , title =. 2015 , eprint =

2015

[64] [67]

Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun Ma and Chao Zhang , title =. Proc. ICLR , year =

[65] [68]

2023 , eprint =

Yunfei Chu and Jin Xu and Xiaohuan Zhou and Qian Yang and Shiliang Zhang and Zhijie Yan and Chang Zhou and Jingren Zhou , title =. 2023 , eprint =

2023

[66] [69]

Rubenstein and Chulayuth Asawaroengchai and Duc Dung Nguyen and Ankur Bapna and Zal

Paul K. Rubenstein and Chulayuth Asawaroengchai and Duc Dung Nguyen and Ankur Bapna and Zal. 2023 , eprint =

2023

[67] [70]

2024 , eprint =

Shengpeng Ji and Chaofan Tian and Minghui Fang and Jialong Zuo and Jiawei Chen and Zhengqi Wen and Baolong Bi and Zu-Yu Kan and Tao Jin and Zhou Zhao , title =. 2024 , eprint =

2024

[68] [71]

Soham Deshmukh and Benjamin Elizalde and Rita Singh and Huaming Wang , title =. Proc. NeurIPS , year =

[69] [72]

Jan Chorowski and Dzmitry Bahdanau and Dmitriy Serdyuk and Kyunghyun Cho and Yoshua Bengio , title =. Proc. NeurIPS , year =

[70] [73]

Le and Oriol Vinyals , title =

William Chan and Navdeep Jaitly and Quoc V. Le and Oriol Vinyals , title =. Proc. ICASSP , year =

[71] [74]

2025 , eprint =

Vaibhav Srivastav and Steven Zheng and Eric Bezzam and Eustache. 2025 , eprint =

2025

[72] [75]

2024 , eprint =

Ziyang Ma and Guanrou Yang and Yifan Yang and Zhifu Gao and Jiaming Wang and Zhihao Du and Fan Yu and Qian Chen and Siqi Zheng and Shiliang Zhang and Xie Chen , title =. 2024 , eprint =

2024

[73] [76]

Rubenstein and Lukas Zilka and Dian Yu and Zhong Meng and Golan Pundak and Nikhil Siddhartha and Johan Schalkwyk and Yonghui Wu , title =

Mingqiu Wang and Wei Han and Izhak Shafran and Zelin Wu and Chung-Cheng Chiu and Yuan Cao and Yongqiang Wang and Nanxin Chen and Yu Zhang and Hagen Soltau and Paul K. Rubenstein and Lukas Zilka and Dian Yu and Zhong Meng and Golan Pundak and Nikhil Siddhartha and Johan Schalkwyk and Yonghui Wu , title =. Proc. ASRU , year =

[74] [77]

Puvvada and Jason Li and Subhankar Ghosh and Jagadeesh Balam and Boris Ginsburg , title =

Zhehuai Chen and He Huang and Andrei Andrusenko and Oleksii Hrinchuk and Krishna C. Puvvada and Jason Li and Subhankar Ghosh and Jagadeesh Balam and Boris Ginsburg , title =. Proc. ICASSP , year =

[75] [78]

Wenyi Yu and Changli Tang and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun Ma and Chao Zhang , title =. Proc. ICASSP , year =

[76] [79]

2024 , eprint =

Ye Bai and Jingping Chen and Jitong Chen and others , title =. 2024 , eprint =

2024

[77] [80]

Francesco Verdini and Danni Liu and Jan Niehues and Marco Gaido and Luisa Bentivogli , title =. Proc. Interspeech , year =

[78] [81]

Tsz Kin Lam and Marco Gaido and Sara Papi and Luisa Bentivogli and Barry Haddow , title =. Proc. NAACL , year =

[79] [82]

Dominik Wagner and Alexander Churchill and Siddharth Sigtia and Erik Marchi , title =. Proc. ICASSP , year =

[80] [83]

Ankit Gupta and George Saon and Brian Kingsbury , title =. Proc. Interspeech , year =