PreferenceASR is a preference-aware ASR test set built from seven corpora that shows model rankings change when user output-style instructions are considered.
hub
arXiv preprint arXiv:2509.14128 , year =
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 17roles
background 1polarities
background 1representative citing papers
A fused self-supervised encoder and learned DP decoder for word alignment outperforms MFA on English datasets and generalizes to unseen languages.
Proposes cross-talk reduction task with CTRnet and pseudo-label far-field separation (PuLSS) to train on real close-talk/far-field pairs, achieving SOTA ASR on CHiME-6 and outperforming guided source separation.
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
Contextual Earnings-22 is a new benchmark dataset showing that scaled keyword prompting and boosting both deliver significantly better accuracy on custom vocabularies than standard academic tests.
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Dual-reference benchmarking on atypical stuttered speech reveals disparities in ASR model performance and rankings between verbatim and intended transcriptions.
TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
NPUsper reduces per-word latency, TTFT, and power for Whisper on mobile NPUs via online hallucination detection and K-step chunk graphs while preserving accuracy.
GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.
Symphony is a medical-grade speech recognition system that decomposes transcription into specialized components and outperforms existing systems in clinical settings while matching them in general domains.
Classical codecs prove more robust to noise than neural codecs, speech enhancement significantly helps noise-affected codecs, and listening effort plus ASR-based metrics add useful nuance beyond basic intelligibility scores.
BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the development set.
A 1B-parameter multilingual offline model is adapted with AlignAtt policy for simultaneous speech translation and submitted to IWSLT 2026 for three language pairs.
citing papers explorer
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.