EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
Common voice: A massively-multilingual speech corpus
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.
HARNESS introduces Arabic-centric speech foundation models that achieve high efficiency and performance through iterative self-distillation and PCA-based signal compression.
Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.
Combining LLM-based elderly-contextual paraphrasing with TTS synthesis using elderly speakers reduces word error rates in elderly ASR by up to 58% over standard Whisper baselines on English and Korean datasets.
The IQRA 2026 challenge on Arabic mispronunciation detection reports a 0.28 F1-score gain from new authentic human error data and diverse modeling approaches including self-supervised and audio-language models.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.
citing papers explorer
-
High Fidelity Neural Audio Compression
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
-
HARNESS: Lightweight Distilled Arabic Speech Foundation Models
HARNESS introduces Arabic-centric speech foundation models that achieve high efficiency and performance through iterative self-distillation and PCA-based signal compression.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.