pith. machine review for the scientific record. sign in

arxiv: 2604.25611 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.SD

Recognition: unknown

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:28 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords real-time ASRstreaming architectureWhisper modelvoice activity detectionlow-latency transcriptionGPU memory efficiencyautomatic speech recognitionbounded memory streaming
0
0 comments X

The pith

WhisperPipe streams the Whisper ASR model at 89 ms median latency with 48 percent less peak GPU memory and near-offline accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WhisperPipe as a streaming architecture for real-time automatic speech recognition that processes audio in bounded segments rather than accumulating full context. It combines a hybrid voice activity detector, overlapping dynamic buffers, and an adaptive processing rule to cut memory use and latency while holding word error rates close to the offline baseline. A sympathetic reader would care because large transformer models deliver high accuracy but normally demand too much compute and memory for live applications on ordinary hardware. The reported results show the system sustains performance over long sessions without memory growth. If the approach holds, real-time transcription could move from specialized servers to everyday devices without sacrificing quality.

Core claim

WhisperPipe is a streaming architecture for the Whisper model that achieves bounded memory consumption through three components: a hybrid VAD pipeline that merges Silero VAD with energy-based filtering to cut false activations by 34 percent, a dynamic buffering scheme using overlapping context windows to avoid boundary information loss, and an adaptive processing strategy that trades latency against accuracy according to speech characteristics. On 2.5 hours of diverse audio the system records a median end-to-end latency of 89 ms (90th percentile 142 ms), 48 percent lower peak GPU memory, 80.9 percent lower average GPU utilization, word error rate within 2 percent of offline Whisper, and zero

What carries the argument

WhisperPipe's hybrid VAD plus overlapping dynamic buffers with adaptive processing, which together allow segment-by-segment transcription without unbounded context accumulation or boundary errors.

If this is right

  • Real-time ASR becomes practical on edge devices and resource-constrained hardware.
  • Transcription accuracy remains within 2 percent of offline batch processing.
  • Systems can operate continuously for hours without memory growth.
  • Latency stays below 150 ms for 90 percent of utterances.
  • Modular design supports deployment from mobile to cloud environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same buffering and VAD pattern could be tested on other large transformer ASR models to check transferability.
  • Lower GPU utilization would reduce operating costs for cloud transcription services.
  • Live captioning in mobile or embedded applications becomes more feasible if the latency and memory gains hold across languages.
  • Further experiments on accented or multi-speaker data would reveal whether the 2 percent WER margin scales.

Load-bearing premise

The hybrid VAD and overlapping buffers must avoid losing critical speech information at segment boundaries and the adaptive rule must not push transcription errors beyond the reported 2 percent WER tolerance on varied speech.

What would settle it

Measure word error rate and memory usage on a held-out set of rapid speaker turns or noisy audio; if WER rises above 4 percent or memory grows after 150 minutes, the central performance claim fails.

Figures

Figures reproduced from arXiv: 2604.25611 by Amir Reza Yosefian, Erfan Ramezani, Hamid Ghadiri, Mohammad Erfan Zarabadipour, Mohammad Mahdi Giahi.

Figure 1
Figure 1. Figure 1: Overview of the WhisperPipe streaming pipeline. Audio is buffered, decoded by Whisper, filtered, and finalized using a two-tier consensus mechanism with timestamp-guided buffer management for efficient real-time transcription. The primary contributions of this work address the fundamental challenges of adapting large-scale transformer models to resource-constrained streaming scenarios. First, we introduce … view at source ↗
Figure 2
Figure 2. Figure 2: WhisperPipe’s dual-buffer mechanism appends audio, commits stable text via a consensus engine, and trims the active buffer at the last committed timestamp to keep the decoding window bounded. 2.1 Audio Acquisition and Scheduling WhisperPipe ingests audio as mono 16 kHz PCM, matching Whisper’s expected input sampling rate and preprocessing pipeline [1]. Let the sample rate be 𝑅 = 16,000Hz and let 𝑎𝑡 ∈ ℝ𝑅Δ d… view at source ↗
Figure 3
Figure 3. Figure 3: State machine of WhisperPipe’s two-tier commit policy, where hypotheses accumulate until agreement satisfies Tier-2 criteria, triggering a 3-way confirmation to commit stable text. A timeout fallback finalizes the best available hypothesis to prevent indefinite waiting. 2.5 Guardrails Incomplete Tokens and Timestamp Disambiguation Even when similarity thresholds are met, premature commitment can occur if t… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Tier 1 commits stable text instantly with a 100% prefix match, while (b)Tier 2 requires prefix stability across three frames to handle acoustic fluctuations. After each commit, the processed audio is trimmed and added to the stable buffer, preserving the remaining context for the next decoding cycle view at source ↗
Figure 10
Figure 10. Figure 10: Computational intensity evolution over session duration. WhisperPipe (blue) maintains a stable, bounded profile throughout the session, while the baseline (orange) exhibits continuous growth proportional to cumulative audio length. The dashed horizontal line indicates the theoretical upper bound imposed by 𝑇𝑏𝑢𝑓 = 30𝑠 view at source ↗
Figure 11
Figure 11. Figure 11: Memory growth rate comparison between WhisperPipe and the baseline over time. WhisperPipe converges to a near-zero growth rate under steady-state operation, while the baseline exhibits a persistent positive slope. Shaded regions indicate one standard deviation across five evaluation runs. To provide a unified measure of resource efficiency, we define the REI as a composite metric incorporating peak GPU me… view at source ↗
Figure 12
Figure 12. Figure 12: Resource Efficiency Index REI comparison between WhisperPipe and the baseline. Higher values indicate better overall resource efficiency. WhisperPipe achieves a significantly higher REI, reflecting the combined gains in memory, utilization, and latency. These results collectively establish WhisperPipe as a resource-efficient solution suitable for deployment in constrained environments, including edge devi… view at source ↗
read the original abstract

Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. Evaluated on 2.5 hours of diverse audio data, WhisperPipe demonstrates a median end-to-end latency of 89ms (90th percentile: 142ms) while consuming 48% less peak GPU memory and 80.9% lower average GPU utilization compared to baseline Whisper implementations. The system maintains stable memory usage over extended sessions, with zero growth rate across 150-minute continuous operation. Comparative analysis against related work shows that WhisperPipe achieves competitive accuracy (WER within 2% of offline Whisper) while operating at 3-5x lower latency than existing streaming solutions. The architecture's modular design enables deployment across resource-constrained environments, from edge devices to cloud infrastructure. Our results demonstrate that careful architectural design can reconcile the competing demands of real-time responsiveness and model sophistication in production ASR systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents WhisperPipe, a streaming ASR architecture for real-time Whisper inference. It introduces a hybrid VAD (Silero VAD plus energy-based filter) claimed to reduce false activations by 34%, a dynamic buffering scheme with overlapping context windows to avoid segment-boundary information loss, and an adaptive processing strategy. On 2.5 hours of diverse audio, the system reports 89 ms median end-to-end latency (90th percentile 142 ms), 48% lower peak GPU memory, 80.9% lower average GPU utilization, WER within 2% of offline Whisper, and zero memory growth over 150-minute sessions, while claiming 3-5x lower latency than prior streaming solutions.

Significance. If the empirical claims hold under rigorous evaluation, WhisperPipe would provide a practical, modular approach to deploying large transformer ASR models in resource-constrained real-time settings without unbounded memory growth. The combination of bounded-memory chunking with accuracy-preserving mechanisms addresses a key deployment barrier; the reported latency and utilization numbers, if reproducible, would be competitive with existing streaming baselines.

major comments (2)
  1. [Evaluation] Evaluation section (2.5-hour test set): the central WER claim (within 2% of offline Whisper) rests on aggregate word error rate only. No per-boundary deletion/substitution breakdown, no ablation of overlap length, and no failure-mode analysis on fast speech, low-energy segments, or accented audio are provided. This leaves open whether the hybrid VAD and overlapping buffers actually prevent the information loss the skeptic note identifies, which is load-bearing for the accuracy claim.
  2. [§3] §3 (hybrid VAD and dynamic buffering): the 34% false-activation reduction and the assertion that overlapping windows 'fully prevent information loss' are stated without quantitative sensitivity analysis on VAD thresholds or overlap size. Since these are the two free parameters listed in the axiom ledger, the paper should demonstrate that the reported latency/memory gains remain stable when these parameters vary within reasonable ranges.
minor comments (2)
  1. [Abstract and Evaluation] The abstract and results section should explicitly name the 2.5-hour evaluation corpus (e.g., specific subsets of Common Voice, LibriSpeech, or in-house data) and the exact baseline Whisper implementation (model size, chunking strategy) to allow direct replication.
  2. [Results] Figure captions and latency histograms would benefit from error bars or percentile shading to convey variability across the 2.5-hour set rather than single median/90th-percentile numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the evaluation and analysis.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (2.5-hour test set): the central WER claim (within 2% of offline Whisper) rests on aggregate word error rate only. No per-boundary deletion/substitution breakdown, no ablation of overlap length, and no failure-mode analysis on fast speech, low-energy segments, or accented audio are provided. This leaves open whether the hybrid VAD and overlapping buffers actually prevent the information loss the skeptic note identifies, which is load-bearing for the accuracy claim.

    Authors: We agree that aggregate WER alone is insufficient to fully validate the boundary-preservation claims. In the revised manuscript we will add a per-boundary deletion/substitution breakdown and an ablation on overlap length. We will also include failure-mode analysis on fast-speech, low-energy, and accented subsets drawn from the existing 2.5-hour diverse test set. These additions will directly test whether the hybrid VAD and overlapping buffers mitigate information loss. revision: partial

  2. Referee: [§3] §3 (hybrid VAD and dynamic buffering): the 34% false-activation reduction and the assertion that overlapping windows 'fully prevent information loss' are stated without quantitative sensitivity analysis on VAD thresholds or overlap size. Since these are the two free parameters listed in the axiom ledger, the paper should demonstrate that the reported latency/memory gains remain stable when these parameters vary within reasonable ranges.

    Authors: We concur that sensitivity analysis on the free parameters is required. The revised version will include quantitative sensitivity results for VAD thresholds and overlap sizes, demonstrating the stability of the 34% false-activation reduction, latency, and memory metrics across reasonable ranges. This will confirm that the reported gains remain robust. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims rest on direct empirical measurements

full rationale

The paper describes an engineering architecture (hybrid VAD, dynamic overlapping buffers, adaptive processing) and supports its claims exclusively through runtime measurements on 2.5 h of audio: median latency 89 ms, 48% lower peak GPU memory, WER within 2% of offline baseline, and zero memory growth over 150 min. No equations, parameter fits, or first-principles derivations are presented that could reduce to their own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of the hybrid VAD and buffering scheme; no machine-checked proofs or parameter-free derivations are supplied. Free parameters such as VAD decision thresholds and buffer overlap lengths are implicitly tuned to achieve the stated 34% false-activation reduction and boundary-loss prevention.

free parameters (2)
  • VAD decision thresholds
    Tuned to achieve the reported 34% reduction in false activations; exact values not stated in abstract.
  • Buffer overlap length
    Chosen to prevent information loss at segment boundaries; size not specified.
axioms (2)
  • domain assumption Hybrid Silero-plus-energy VAD reliably segments speech without missing content that would degrade downstream ASR accuracy
    Invoked to justify the 34% false-activation claim and the overall accuracy maintenance.
  • domain assumption Overlapping dynamic buffers fully compensate for context loss at chunk boundaries
    Central to the claim that transcription quality remains within 2% of offline Whisper.

pith-pipeline@v0.9.0 · 5587 in / 1604 out tokens · 66367 ms · 2026-05-07T16:28:22.784341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 47 canonical work pages

  1. [1]

    Introduction Automatic speech recognition ASR has undergone transformative advances in recent years, driven primarily by the convergence of large-scale weakly supervised learning and transformer -based architectures [1]. Unlike traditional ASR systems that rely on carefully curated, domain -specific transcriptions, modern approaches leverage vast quantiti...

  2. [2]

    Let 𝑥(𝑡)denote the incoming audio stream

    Method WhisperPipe is a streaming inference framework that transforms Whisper’s batch -oriented decoding into continuous live transcription with bounded steady-state compute and memory. Let 𝑥(𝑡)denote the incoming audio stream. WhisperPipe maintains two persistent buffers: - Committed Text Buffer S, an immutable sequence of finalized tokens or words that ...

  3. [3]

    Results We evaluate WhisperPipe under a continuous transcription setting designed to reflect real -world deployment conditions, including live captioning, conversational agents, and long -running voice interfaces. Unlike offline benchmarks that assess transcription quality in isolation, o ur evaluation protocol targets the operational constraints that ari...

  4. [4]

    Discussion The results presented in Section 5 demonstrate that WhisperPipe achieves substantial improvements across latency, stability, and resource efficiency without sacrificing transcription quality. Figure 13 provides a consolidated view of these multi -metric improvements, illustrating the simultaneous gains in response time, memory footprint, and GP...

  5. [5]

    Conclusion This paper introduced WhisperPipe, a streaming ASR architecture designed to address the latency, stability, and resource efficiency challenges inherent in real-time transcription of continuous audio streams. By integrating acoustic and semantic filtering, incremental decoding with a two -tier commit policy, and timestamp-guided audio slicing, W...

  6. [8]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Zhang, Y., Han, W., Qin, J., et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2212.04356

  7. [9]

    Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition

    Dong, L., Xu, S., & Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. ICASSP (2018). https://doi.org/10.1109/ICASSP.2018.8462506

  8. [10]

    Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Moritz, N., Hori, T., & Le Roux, J. Streaming Automatic Speech Recognition with Blockwise Synchronous Transformer. ICASSP (2020). https://doi.org/10.1109/ICASSP40776.2020.9053742

  9. [11]

    Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Zhang, Q., et al. Streaming Transformer for End-to-End Speech Recognition. ICASSP (2020). https://doi.org/10.1109/ICASSP40776.2020.9054418

  10. [12]

    Revach, N

    Chen, X., et al. Developing Real-Time Streaming Transformer Transducer for Speech Recognition. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9414200

  11. [13]

    Transformer Transducer: A Streamable Speech Recognition Model

    Zhang, Y., et al. Transformer Transducer: A Streamable Speech Recognition Model. ASRU (2021). https://doi.org/10.1109/ASRU51503.2021.9688007

  12. [14]

    Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

    Kannan, A., et al. Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model. Interspeech (2019). https://doi.org/10.21437/Interspeech.2019-1642

  13. [15]

    Jasper: An End-to-End Convolutional Neural Acoustic Model

    Li, J., Lavrukhin, V., Ginsburg, B., et al. Jasper: An End-to-End Convolutional Neural Acoustic Model. Interspeech (2019). https://doi.org/10.21437/Interspeech.2019-1819

  14. [16]

    Deep Speech: Scaling Up End-to-End Speech Recognition

    Hannun, A., et al. Deep Speech: Scaling Up End-to-End Speech Recognition. Communications of the ACM (2019). https://doi.org/10.1145/3323037

  15. [17]

    A Comparison of Sequence-to-Sequence Models for Speech Recognition

    Prabhavalkar, R., et al. A Comparison of Sequence-to-Sequence Models for Speech Recognition. Interspeech (2019). https://doi.org/10.21437/Interspeech.2019-1848

  16. [18]

    C., et al

    Chiu, C. C., et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. ICASSP (2018). https://doi.org/10.1109/ICASSP.2018.8462105

  17. [19]

    Listen, Attend and Spell: A Neural Network for Large Vocabulary Speech Recognition

    Chan, W., et al. Listen, Attend and Spell: A Neural Network for Large Vocabulary Speech Recognition. IEEE Signal Processing Magazine (2018). https://doi.org/10.1109/MSP.2018.2889381

  18. [20]

    A Comprehensive Study of Streaming Models for Speech Recognition

    Zeyer, A., et al. A Comprehensive Study of Streaming Models for Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021). https://doi.org/10.1109/TASLP.2021.3072324

  19. [21]

    Efficient Streaming ASR with Adaptive Chunk Transformer

    Wang, Y., et al. Efficient Streaming ASR with Adaptive Chunk Transformer. Interspeech (2023). https://doi.org/10.21437/Interspeech.2023-1234

  20. [22]

    Scaling Speech Recognition with Transformer Models

    Chen, Z., et al. Scaling Speech Recognition with Transformer Models. IEEE/ACM TASLP (2022). https://doi.org/10.1109/TASLP.2022.3152414

  21. [23]

    Streaming End-to-End Speech Recognition with Transformer Transducer

    Peng, Z., et al. Streaming End-to-End Speech Recognition with Transformer Transducer. IEEE/ACM TASLP (2023). https://doi.org/10.1109/TASLP.2023.3245672

  22. [24]

    WeNet: Production-First Speech Recognition Toolkit

    Zhang, B., et al. WeNet: Production-First Speech Recognition Toolkit. Interspeech (2022). https://doi.org/10.21437/Interspeech.2022-10630

  23. [25]

    Macháček, R

    D. Macháček, R. Dabre, and O. Bojar, Turning Whisper into Real-Time Transcription System. arXiv preprint (2023). https://arxiv.org/abs/2307.14743

  24. [26]

    Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

    Bain, M., et al. WhisperX: Speech Recognition with Word-Level Alignment. arXiv preprint (2023). https://doi.org/10.48550/arXiv.2303.00747

  25. [27]

    Uncertainty-based streaming ASR with evidential deep learning

    Sato, H., Sakuma, A., Sugano, R., et al. Uncertainty-based streaming ASR with evidential deep learning. IEEE Open Journal of Signal Processing (2026). https://doi.org/10.1109/OJSP.2026.3657308

  26. [28]

    Joint CTC-Attention Based End-to-End Speech Recognition Using Multi-Task Learning

    Kim, S., et al. Joint CTC-Attention Based End-to-End Speech Recognition Using Multi-Task Learning. ICASSP (2018). https://doi.org/10.1109/ICASSP.2018.8461375

  27. [30]

    k2SSL: A faster and better framework for self-supervised speech representation learning

    Yang, Y., Zhuo, J., Jin, Z., et al. k2SSL: A faster and better framework for self-supervised speech representation learning. arXiv preprint (2024). https://doi.org/10.48550/arXiv.2603.16920

  28. [31]

    Chen, Stefano Mangini, and Marcel Worring

    Chen, N., et al. Exploring Streaming Speech Recognition with Transformer Architectures. ICASSP (2022). https://doi.org/10.1109/ICASSP43922.2022.9746205

  29. [32]

    Improving Streaming ASR with Chunk-Based Self-Attention

    Wang, Y., et al. Improving Streaming ASR with Chunk-Based Self-Attention. Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-2718

  30. [33]

    End-to-End Streaming Speech Recognition with Transformer Models

    Kim, J., et al. End-to-End Streaming Speech Recognition with Transformer Models. ASRU (2019). https://doi.org/10.1109/ASRU46091.2019.9003950

  31. [34]

    Revach, N

    Liu, Y., et al. Streaming Speech Recognition Using Self-Attention Networks. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9414210

  32. [35]

    Transformer-Based Streaming End-to-End Speech Recognition

    Zhang, X., et al. Transformer-Based Streaming End-to-End Speech Recognition. Interspeech (2022). https://doi.org/10.21437/Interspeech.2022-1120

  33. [36]

    Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

    Sirichotedumrong, W., Na-Thalang, A., Manakul, P., et al. Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition. arXiv preprint (2026). https://doi.org/10.48550/arXiv.2601.13044

  34. [37]

    Improved RNN-T for Streaming Speech Recognition

    Kim, S., et al. Improved RNN-T for Streaming Speech Recognition. Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-1073

  35. [38]

    Streaming End-to-End Speech Recognition for Mobile Devices

    He, Y., et al. Streaming End-to-End Speech Recognition for Mobile Devices. ICASSP (2019). https://doi.org/10.1109/ICASSP.2019.8682678

  36. [39]

    Advances in Joint CTC-Attention Based End-to-End Speech Recognition

    Hori, T., et al. Advances in Joint CTC-Attention Based End-to-End Speech Recognition. IEEE SLT (2018). https://doi.org/10.1109/SLT.2018.8639585

  37. [40]

    Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Narayanan, A., et al. Toward Streaming Speech Recognition with Transformer Models. ICASSP (2020). https://doi.org/10.1109/ICASSP40776.2020.9053092

  38. [41]

    ESPnet: End-to-end speech processing toolkit

    Watanabe, S., et al. ESPnet: End-to-End Speech Processing Toolkit. Interspeech (2018). https://doi.org/10.21437/Interspeech.2018-1456

  39. [42]

    Streaming End-to-End Speech Recognition with Neural Transducers

    Kanda, N., et al. Streaming End-to-End Speech Recognition with Neural Transducers. ICASSP (2019). https://doi.org/10.1109/ICASSP.2019.8682694

  40. [43]

    End-to-End Speech Recognition with Transformer Transducer

    Chen, G., et al. End-to-End Speech Recognition with Transformer Transducer. Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1976

  41. [44]

    In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

    Wang, Z., et al. Streaming Speech Recognition Using Contextual Transformer Models. ICASSP (2023). https://doi.org/10.1109/ICASSP49357.2023.10094873

  42. [45]

    Online Speech Recognition with Transformer-Based Architectures

    Kim, J., et al. Online Speech Recognition with Transformer-Based Architectures. Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1402

  43. [46]

    Revach, N

    Sun, Y., et al. Efficient Self-Attention for Streaming Speech Recognition. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9413802

  44. [47]

    Streaming Conformer for End-to-End Speech Recognition

    Zhou, Y., et al. Streaming Conformer for End-to-End Speech Recognition. Interspeech (2021). https://doi.org/10.21437/Interspeech.2021-1245

  45. [48]

    Chen, Stefano Mangini, and Marcel Worring

    Chen, Y., et al. Improving Transformer-Based ASR Systems for Real-Time Applications. ICASSP (2022). https://doi.org/10.1109/ICASSP43922.2022.9747031

  46. [49]

    Efficient Streaming Transformer Transducer for Speech Recognition

    Liu, X., et al. Efficient Streaming Transformer Transducer for Speech Recognition. Interspeech (2023). https://doi.org/10.21437/Interspeech.2023-1489

  47. [50]

    Revach, N

    Zhang, H., et al. Real-Time End-to-End Speech Recognition with Streaming Transformers. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9414041

  48. [51]

    Real-Time Speech Recognition Using Adaptive Attention Models

    Wang, L., et al. Real-Time Speech Recognition Using Adaptive Attention Models. Interspeech (2020). https://doi.org/10.21437/Interspeech.2020-1887

  49. [52]

    Revach, N

    Kim, Y., et al. Low-Latency Streaming Speech Recognition with Neural Transducers. ICASSP (2021). https://doi.org/10.1109/ICASSP39728.2021.9413731

  50. [53]

    Transformer-Based Online Speech Recognition for Low-Latency Applications

    Liu, Z., et al. Transformer-Based Online Speech Recognition for Low-Latency Applications. Interspeech (2022). https://doi.org/10.21437/Interspeech.2022-2451

  51. [54]

    In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

    Zhang, T., et al. End-to-End Streaming Speech Recognition with Contextual Attention. ICASSP (2023). https://doi.org/10.1109/ICASSP49357.2023.10094721

  52. [56]

    Ramezani, E., & Giahi, M. M. WhisperPipe: Source Code and Implementation for Real-Time ASR (0.1.1). Zenodo (2026). https://doi.org/10.1109/TASLP.2023.3254102