MUSAN: A Music, Speech, and Noise Corpus

David Snyder , Guoguo Chen , Daniel Povey

Authors on Pith no claims yet

classification 💻 cs.SD

keywords musicspeechcorpusdatasetdiscriminationnoiseactivityassortment

read the original abstract

This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mechanistic Interpretability of ASR models using Sparse Autoencoders
cs.CL 2026-05 unverdicted novelty 7.0

Sparse autoencoders applied to Whisper ASR reveal monosemantic features across linguistic boundaries and demonstrate cross-lingual feature steering.
Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
cs.CL 2026-04 unverdicted novelty 7.0

LLM decoders in speech recognition show no racial bias amplification and fewer repetition hallucinations under degradation than Whisper, with audio encoder design mattering more than model scale for fairness and robustness.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
cs.SD 2026-05 unverdicted novelty 6.0

SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
cs.SD 2026-04 unverdicted novelty 6.0

HyPeR is a hybrid perception-reasoning framework that uses a new hierarchical PAQA dataset and PAUSE tokens to improve large audio language models' handling of multi-speaker and ambiguous audio.
MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting
eess.AS 2026-04 unverdicted novelty 6.0

MALEFA reaches 90% accuracy and 0.007% false alarm rate on AMI for zero-shot KWS via cross-attention and multi-granularity contrastive learning while running efficiently on constrained hardware.
PhiNet: Speaker Verification with Phonetic Interpretability
eess.AS 2026-04 unverdicted novelty 6.0

PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
eess.AS 2026-05 unverdicted novelty 5.0

The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
cs.SD 2026-04 unverdicted novelty 5.0

Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.
Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification
eess.AS 2026-04 unverdicted novelty 5.0

Dual-LoRA with a language-anchored adversary achieves 0.91% EER on the TidyVoice benchmark for cross-lingual speaker verification by targeting true linguistic cues while preserving speaker discriminability.
UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition
eess.AS 2026-04 unverdicted novelty 5.0

Feeding noisy and enhanced speech together into a speaker encoder with EMA adaptation from clean pre-training improves recognition accuracy under noise.
Enhancing Speaker Verification with Whispered Speech via Post-Processing
cs.SD 2026-04 unverdicted novelty 4.0

Post-processing with an encoder-decoder model yields 22% relative EER reduction on normal-vs-whispered trials and 1.88% EER on whispered-vs-whispered, outperforming ReDimNet-B2.
Spoken Language Identification with Pre-trained Models and Margin Loss
cs.SD 2026-05 unverdicted novelty 3.0

Pre-trained ECAPA-TDNN with margin losses reaches 85.95% macro and 90.96% micro accuracy on language identification plus 17.08% EER on verification, beating the official baseline by 45.7%, 15.2%, and 50.8% respectively.