Common voice: A massively-multilingual speech corpus

· 1912 · arXiv 1912.06670

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

eess.AS · 2022-10-24 · accept · novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.

BlasBench: An Open Benchmark for Irish Speech Recognition

cs.CL · 2026-04-12 · conditional · novelty 6.0

BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

eess.AS · 2026-03-31 · accept · novelty 5.5

HARNESS introduces Arabic-centric speech foundation models that achieve high efficiency and performance through iterative self-distillation and PCA-based signal compression.

Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs

cs.CL · 2026-05-02 · unverdicted · novelty 5.0

Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.

Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

cs.CL · 2026-04-15 · unverdicted · novelty 5.0

Combining LLM-based elderly-contextual paraphrasing with TTS synthesis using elderly speakers reduces word error rates in elderly ASR by up to 58% over standard Whisper baselines on English and Korean datasets.

IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA)

cs.SD · 2026-03-31 · unverdicted · novelty 5.0

The IQRA 2026 challenge on Arabic mispronunciation detection reports a 0.28 F1-score gain from new authentic human error data and diverse modeling approaches including self-supervised and audio-language models.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

eess.AS · 2026-04-14 · unverdicted · novelty 4.0

Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.

Keyword spotting using convolutional neural network for speech recognition in Hindi

cs.SD · 2026-04-26 · unverdicted · novelty 2.0

CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.

citing papers explorer

Showing 4 of 4 citing papers after filters.

High Fidelity Neural Audio Compression eess.AS · 2022-10-24 · accept · none · ref 2
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
HARNESS: Lightweight Distilled Arabic Speech Foundation Models eess.AS · 2026-03-31 · accept · none · ref 2
HARNESS introduces Arabic-centric speech foundation models that achieve high efficiency and performance through iterative self-distillation and PCA-based signal compression.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 1
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions eess.AS · 2026-04-14 · unverdicted · none · ref 31
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.

Common voice: A massively-multilingual speech corpus

fields

years

verdicts

representative citing papers

citing papers explorer