pith. machine review for the scientific record. sign in

arxiv: 2604.14493 · v2 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

David Fan, Kunal Vaishnavi, Meng Tang, Nenad Banfic, Rui Ren, Sam Kemp, Sayan Shaw, Sunghoon Choi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords streaming ASRon-device inferencemodel quantizationword error rateONNX RuntimeNemotronlow-latencyedge devices
0
0 comments X

The pith

Re-implementing and int4-quantizing Nemotron yields a 0.67 GB streaming ASR model with 8.2 percent WER and 0.56 second CPU latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates over fifty configurations of encoder-decoder, transducer, and LLM-based ASR systems across batch, chunked, and streaming modes to find the best candidate for resource-constrained English on-device use. It selects NVIDIA Nemotron Speech Streaming, reimplements its full streaming pipeline in ONNX Runtime, and tests multiple post-training quantization methods including importance-weighted k-quant plus graph fusion. These steps shrink the model from 2.47 GB to 0.67 GB while keeping word error rate within one percent of the original full-precision baseline. The resulting int4 k-quant version delivers 8.20 percent average streaming WER over eight benchmarks and runs faster than real time, which matters because it demonstrates a practical route to high-quality speech recognition on ordinary CPUs without GPUs.

Core claim

After systematic comparison the authors establish that NVIDIA Nemotron Speech Streaming, once reimplemented in ONNX Runtime and optimized with int4 k-quantization plus operator fusion, reduces model size to 0.67 GB and achieves an average streaming word error rate of 8.20 percent across eight standard benchmarks while maintaining 0.56 seconds algorithmic latency and running comfortably faster than real time on CPU.

What carries the argument

The int4 k-quant variant of the ONNX Runtime streaming inference pipeline, which applies importance-weighted quantization and graph-level operator fusion to the Nemotron transducer.

If this is right

  • High-accuracy streaming ASR becomes feasible on standard CPUs without GPU support.
  • Model memory footprint can be cut by more than 70 percent with less than 1 percent absolute WER increase.
  • The eight-benchmark average of 8.20 percent WER sets a concrete quality target for future on-device systems.
  • Quantization combined with operator fusion preserves transducer accuracy better than round-to-nearest alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar re-implementation and quantization steps could be tested on other transducer or LLM-based ASR models to locate additional Pareto improvements.
  • The low algorithmic latency opens the possibility of tighter integration with on-device language models for end-to-end voice agents.
  • If the same pipeline scales to non-English languages, the approach would directly address multilingual on-device ASR gaps.

Load-bearing premise

The ONNX Runtime re-implementation exactly reproduces the original PyTorch model's streaming behavior under all tested conditions.

What would settle it

A side-by-side run of the original PyTorch model and the ONNX version on identical streaming audio would show a word error rate difference larger than one percent absolute.

Figures

Figures reproduced from arXiv: 2604.14493 by David Fan, Kunal Vaishnavi, Meng Tang, Nenad Banfic, Rui Ren, Sam Kemp, Sayan Shaw, Sunghoon Choi.

Figure 1
Figure 1. Figure 1: Parakeet TDT-0.6B-v3: WER vs. total context window across 18 chunking config [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Nemotron-0.6B: Delay vs. WER trade-off across streaming configurations. The con [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model size vs. WER for Nemotron quantization variants. The int4 k-quant variant [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CPU RTFx vs. WER. All ONNX variants achieve RTFx [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effective latency vs. algorithmic delay for Nemotron-0.6B int4 k-quant on CPU. The [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic empirical study of state-of-the-art ASR architectures, encompassing encoder-decoder, transducer, and LLM-based paradigms, evaluated across batch, chunked, and streaming inference modes. Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA's Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implement the complete streaming inference pipeline in ONNX Runtime and conduct a controlled evaluation of multiple post-training quantization strategies, including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduce the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline. Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56 s algorithmic latency, establishing a new quality-efficiency Pareto point for on-device streaming ASR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper benchmarks over 50 configurations of ASR models (Whisper, Nemotron, Parakeet TDT, Canary, Conformer Transducer, Qwen3-ASR) in batch, chunked, and streaming modes on CPU. It selects NVIDIA Nemotron Speech Streaming as the strongest candidate, re-implements its full streaming inference pipeline in ONNX Runtime, applies post-training quantization (importance-weighted k-quant, mixed-precision, round-to-nearest) plus operator fusion, and reports that the int4 k-quant variant reduces size from 2.47 GB to 0.67 GB while achieving 8.20% average streaming WER across eight standard benchmarks, 0.56 s algorithmic latency, and performance within 1% absolute WER of the full-precision PyTorch baseline, establishing a new quality-efficiency Pareto frontier for on-device streaming ASR.

Significance. If the ONNX re-implementation faithfully reproduces PyTorch behavior, the work supplies a concrete, reproducible reference point for efficient on-device English streaming ASR with measured trade-offs in WER, size, and latency. The systematic architecture comparison and controlled quantization study would be useful for practitioners deploying ASR on edge hardware.

major comments (1)
  1. [ONNX re-implementation and quantization evaluation sections] The equivalence between the original PyTorch Nemotron streaming model and the authors' ONNX Runtime re-implementation is not demonstrated with side-by-side WER or token-level metrics on the eight benchmarks prior to quantization. Streaming transducers are sensitive to state carry-over, chunk overlap, and operator ordering; without this verification, the reported 1% WER delta cannot be confidently attributed to the int4 k-quant and fusion steps rather than implementation differences. This verification is load-bearing for the central claim in the abstract and results.
minor comments (3)
  1. [Results] The abstract and results do not report error bars, standard deviations, or statistical significance tests on the WER numbers across the eight benchmarks or across multiple runs.
  2. [Methods] Exact quantization parameters (e.g., calibration dataset size, importance weighting details, bit allocation in mixed-precision) and the precise definition of 'algorithmic latency' are not fully specified, limiting reproducibility.
  3. [Experimental setup] The selection criteria and characteristics of the eight 'standard benchmarks' are not detailed; it is unclear how representative they are of real-world streaming conditions with varying utterance lengths and noise.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the referee's insightful comments on our manuscript. We have carefully considered the major concern regarding the verification of our ONNX re-implementation and provide our response below. We will make the necessary revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [ONNX re-implementation and quantization evaluation sections] The equivalence between the original PyTorch Nemotron streaming model and the authors' ONNX Runtime re-implementation is not demonstrated with side-by-side WER or token-level metrics on the eight benchmarks prior to quantization. Streaming transducers are sensitive to state carry-over, chunk overlap, and operator ordering; without this verification, the reported 1% WER delta cannot be confidently attributed to the int4 k-quant and fusion steps rather than implementation differences. This verification is load-bearing for the central claim in the abstract and results.

    Authors: We acknowledge that the equivalence between the PyTorch and ONNX implementations was not explicitly demonstrated with side-by-side metrics in the submitted manuscript. Our ONNX re-implementation was intended to faithfully reproduce the PyTorch streaming behavior, but to address this valid point, we will add a comparison table in the revised manuscript showing the WER results for the full-precision models on all eight benchmarks. This will allow readers to verify that any observed differences post-quantization are due to the quantization process rather than implementation discrepancies. We believe this addition will resolve the concern and support the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical benchmarking and optimization results

full rationale

The paper conducts an empirical study: it benchmarks over 50 configurations of existing ASR models (Whisper, Nemotron, etc.) across inference modes, selects Nemotron as strongest, re-implements its streaming pipeline in ONNX Runtime, applies post-training quantization variants, and reports measured WER, latency, and size on eight benchmarks. No equations, derivations, fitted parameters, or predictions are defined in terms of themselves. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. All headline claims (8.20% WER, 0.56 s latency, size reduction to 0.67 GB) are direct experimental outcomes, not reductions to inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurements after applying standard quantization to a pre-trained model. No new physical entities or unproven mathematical axioms are introduced; the main assumptions concern faithful re-implementation and benchmark representativeness.

free parameters (1)
  • quantization scheme and bit-width
    The choice of int4 k-quant among several tested strategies is selected to achieve the reported size-accuracy trade-off.
axioms (1)
  • domain assumption The ONNX Runtime streaming pipeline faithfully reproduces the original PyTorch model's inference outputs and latency characteristics.
    This assumption is required for the quantization results to be directly comparable to the full-precision baseline.

pith-pipeline@v0.9.0 · 5583 in / 1302 out tokens · 44358 ms · 2026-05-10T12:00:46.525405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProc. ICML,https://arxiv.org/ abs/2212.04356, 2023

  2. [2]

    NVIDIA NeMo: a toolkit for building new AI models,

    NVIDIA NeMo, “NVIDIA NeMo: a toolkit for building new AI models,”https://github. com/NVIDIA/NeMo, 2024

  3. [3]

    Qwen3-ASR Technical Report

    Qwen Team, “Qwen3-ASR Technical Report,”https://arxiv.org/abs/2601.21337, 2026

  4. [4]

    Olive: a hardware-aware model optimization tool for ONNX models,

    Microsoft, “Olive: a hardware-aware model optimization tool for ONNX models,”https: //github.com/microsoft/Olive, 2024

  5. [5]

    Faster-Whisper: faster inference for OpenAI’s Whisper using CTranslate2,

    SYSTRAN, “Faster-Whisper: faster inference for OpenAI’s Whisper using CTranslate2,” https://github.com/SYSTRAN/faster-whisper, 2024

  6. [6]

    Parakeet TDT-0.6B-v3,

    NVIDIA, “Parakeet TDT-0.6B-v3,”https://huggingface.co/nvidia/parakeet-tdt-0. 6b-v3, 2025

  7. [7]

    Conformer Transducer XL,

    NVIDIA, “Conformer Transducer XL,”https://huggingface.co/nvidia/stt_en_ conformer_transducer_xlarge, 2022

  8. [8]

    Nemotron Speech Streaming,

    NVIDIA, “Nemotron Speech Streaming,”https://huggingface.co/nvidia/ nemotron-speech-streaming-en-0.6b, 2025

  9. [9]

    Open ASR Leaderboard

    Hugging Face, “Open ASR Leaderboard”,https://huggingface.co/spaces/hf-audio/ open_asr_leaderboard, 2024

  10. [10]

    ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,

    ONNX Runtime Contributors, “ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,”https://onnxruntime.ai/, 2024

  11. [11]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,”https://arxiv.org/abs/ 2306.00978, 2023

  12. [12]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,”https://arxiv.org/abs/2210. 17323, 2022

  13. [13]

    2-bit Conformer Quantization for Automatic Speech Recognition,

    J. Yu, Y. Park, and S. Watanabe, “2-bit Conformer Quantization for Automatic Speech Recognition,”https://arxiv.org/abs/2305.16619, 2023

  14. [14]

    Voxtral.arXiv preprint arXiv:2507.13264,

    Mistral AI Team, “Voxtral,”https://arxiv.org/abs/2507.13264, 2025

  15. [15]

    Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications,

    M. Kudlur, E. King, J. Wang, and P. Warden, “Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications,”https://arxiv.org/abs/2602. 12241, 2026. 15 A Complete Evaluation Results This appendix reports all configurations evaluated during this study, including incomplete runs and configurations not featured in the main text. All WE...