arxiv: 2602.11298 · v3 · submitted 2026-02-11 · 💻 cs.AI
Recognition: unknown
Voxtral Realtime
show 159 more authors
Are you an author? Sign in to claim this paper.
read the original abstract
We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.
-
Tadabur: A Large-Scale Quran Audio Dataset
cs.SD 2026-04 unverdicted novelty 7.0
Tadabur is a large-scale Quran audio dataset with over 1400 hours from 600+ reciters to support speech research and benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.