Recognition: 2 theorem links
· Lean TheoremQwen3-ASR Technical Report
Pith reviewed 2026-05-13 13:55 UTC · model grok-4.3
The pith
Qwen3-ASR-1.7B matches proprietary APIs on multilingual speech recognition while the 0.6B version maximizes efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen3-ASR-1.7B achieves state-of-the-art results among open-sourced ASR models and remains competitive with the strongest proprietary APIs across 52 languages, while Qwen3-ASR-0.6B delivers the best accuracy-efficiency trade-off and Qwen3-ForcedAligner-0.6B outperforms prior force alignment systems in both accuracy and speed.
What carries the argument
The Qwen3-ASR models, which are all-in-one speech recognition systems that directly leverage the audio understanding capabilities of the Qwen3-Omni foundation model together with large-scale speech training data.
If this is right
- Open-source ASR can reach parity with closed commercial systems for broad multilingual coverage without requiring users to pay for API access.
- Smaller models with sub-100ms first-token latency enable high-concurrency transcription workloads on modest hardware.
- Non-autoregressive timestamp prediction extends accurate forced alignment to more languages with lower computational cost than autoregressive alternatives.
- Releasing both ASR and alignment models under Apache 2.0 removes licensing barriers for downstream research and product integration.
Where Pith is reading between the lines
- Wider adoption of these models could improve accessibility tools for speakers of lower-resource languages that currently receive poor support from proprietary services.
- Combining the ASR outputs with other Qwen3 components may enable end-to-end spoken dialogue systems that remain fully open.
- The emphasis on internal evaluation suggests that future ASR progress will depend more on representative test collections than on public leaderboard scores alone.
Load-bearing premise
Internal evaluations reveal meaningful real-world quality differences that standard open benchmarks do not capture.
What would settle it
Independent tests on diverse, noisy real-world audio recordings across multiple languages where the 1.7B model falls below the top open-source ASR systems or the 0.6B model loses its claimed efficiency advantage.
read the original abstract
In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Qwen3-ASR family, comprising Qwen3-ASR-1.7B and Qwen3-ASR-0.6B models for multilingual ASR across 52 languages and dialects that build on the audio capabilities of the Qwen3-Omni foundation model, along with a Qwen3-ForcedAligner-0.6B non-autoregressive model for timestamp prediction in 11 languages. It asserts that the 1.7B model attains SOTA performance among open-source ASR systems and competes with leading proprietary APIs on the basis of comprehensive internal evaluations, that the 0.6B model provides the best accuracy-efficiency trade-off (with reported TTFT of 92 ms and throughput of 2000 seconds of speech per second at concurrency 128), and that the aligner outperforms the three strongest existing force-alignment models in timestamp accuracy while offering efficiency and versatility advantages. The models are released under Apache 2.0.
Significance. If the internal evaluation results prove representative and the comparisons are fair, the work would be significant by delivering competitive open-source multilingual ASR models with strong efficiency characteristics and by releasing them publicly, thereby lowering barriers to research in speech recognition and audio understanding.
major comments (2)
- [Abstract] Abstract: The central SOTA and competitiveness claims for the 1.7B model (and the accuracy-efficiency trade-off for the 0.6B model) rest entirely on results from undisclosed 'comprehensive internal evaluation'; no test-set descriptions, language-specific metrics, error rates, comparison protocols against proprietary APIs, or quantitative tables appear in the manuscript, rendering the primary performance assertions unverifiable.
- [Abstract] Abstract: The timestamp-accuracy superiority claimed for Qwen3-ForcedAligner-0.6B is stated without any reported metrics, baseline scores, or experimental details, which is load-bearing for the claim that it 'outperforms the three strongest force alignment models'.
minor comments (1)
- The manuscript would benefit from adding a dedicated results section or table that reports the internal evaluation numbers, hardware specifications, and exact comparison methodology to support the efficiency and accuracy claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency in our evaluation results. We address the major comments point by point below and will revise the manuscript to incorporate additional details where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central SOTA and competitiveness claims for the 1.7B model (and the accuracy-efficiency trade-off for the 0.6B model) rest entirely on results from undisclosed 'comprehensive internal evaluation'; no test-set descriptions, language-specific metrics, error rates, comparison protocols against proprietary APIs, or quantitative tables appear in the manuscript, rendering the primary performance assertions unverifiable.
Authors: We agree that the manuscript would benefit from more explicit descriptions of the internal evaluations to improve verifiability. In the revised version, we will add a dedicated evaluation section that describes the test sets (including language coverage and sources where possible), language-specific metrics such as WER and CER, error rate breakdowns, and the protocols used for comparisons against proprietary APIs. We will also include quantitative tables summarizing key results. Some internal datasets remain proprietary for commercial reasons, so we cannot release raw data or full test-set details, but we will provide sufficient methodological information to support the reported claims. revision: yes
-
Referee: [Abstract] Abstract: The timestamp-accuracy superiority claimed for Qwen3-ForcedAligner-0.6B is stated without any reported metrics, baseline scores, or experimental details, which is load-bearing for the claim that it 'outperforms the three strongest force alignment models'.
Authors: We acknowledge the absence of specific quantitative metrics and experimental details for the forced aligner in the current manuscript. In the revised version, we will add a new subsection under Experiments that reports timestamp accuracy metrics (e.g., mean alignment error or boundary precision), baseline scores from the three strongest existing force-alignment models, and full experimental setup details including evaluation datasets and protocols. This will directly substantiate the performance claims with numbers and comparisons. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper is an empirical technical report describing ASR model training, data scaling, and benchmark results with no mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. Claims rest on reported performance numbers from open and internal evaluations rather than any self-definitional loop, renamed known result, or load-bearing self-citation chain. Reference to the prior Qwen3-Omni foundation model is ordinary transfer learning and does not create circularity under the specified patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- Model parameter counts
- Training data volume
axioms (1)
- domain assumption Qwen3-Omni foundation model possesses strong audio understanding ability that transfers to ASR
Forward citations
Cited by 21 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
-
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...
-
Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
LLM decoders in speech recognition show no racial bias amplification and fewer repetition hallucinations under degradation than Whisper, with audio encoder design mattering more than model scale for fairness and robustness.
-
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency wi...
-
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
-
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization
A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outp...
-
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
-
Dolphin-CN-Dialect: Where Chinese Dialects Matter
Dolphin-CN-Dialect is a compact ASR model that boosts Chinese dialect accuracy through balanced sampling of rare dialects and character-level tokenization while staying smaller than recent open-source competitors.
-
Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding
The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.
-
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
-
Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference
A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.
-
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...
-
2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA
ASR-SaSaSa2VA turns audio into text via ASR then feeds it to pre-trained referring video segmentation models, achieving 80.7 and second place in the 5th PVUW MeViS-v2-Audio track.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.