Recognition: unknown
MUSAN: A Music, Speech, and Noise Corpus
read the original abstract
This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.
This paper has not been read by Pith yet.
Forward citations
Cited by 13 Pith papers
-
Mechanistic Interpretability of ASR models using Sparse Autoencoders
Sparse autoencoders applied to Whisper ASR reveal monosemantic features across linguistic boundaries and demonstrate cross-lingual feature steering.
-
Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
LLM decoders in speech recognition show no racial bias amplification and fewer repetition hallucinations under degradation than Whisper, with audio encoder design mattering more than model scale for fairness and robustness.
-
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
-
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
SpeakerLLM unifies speaker profiling, recording-condition understanding, and structured verification reasoning in an audio-LLM via a hierarchical tokenizer and decision traces.
-
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
HyPeR is a hybrid perception-reasoning framework that uses a new hierarchical PAQA dataset and PAUSE tokens to improve large audio language models' handling of multi-speaker and ambiguous audio.
-
MALEFA: Multi-grAnularity Learning and Effective False Alarm Suppression for Zero-shot Keyword Spotting
MALEFA reaches 90% accuracy and 0.007% false alarm rate on AMI for zero-shot KWS via cross-attention and multi-granularity contrastive learning while running efficiently on constrained hardware.
-
PhiNet: Speaker Verification with Phonetic Interpretability
PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.
-
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
-
Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
Diffusion reconstruction creates hard samples for audio deepfake detection training, and when paired with feature aggregation and RACL, it reduces average EER versus baselines.
-
Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification
Dual-LoRA with a language-anchored adversary achieves 0.91% EER on the TidyVoice benchmark for cross-lingual speaker verification by targeting true linguistic cues while preserving speaker discriminability.
-
UNet-Based Fusion and Exponential Moving Average Adaptation for Noise-Robust Speaker Recognition
Feeding noisy and enhanced speech together into a speaker encoder with EMA adaptation from clean pre-training improves recognition accuracy under noise.
-
Enhancing Speaker Verification with Whispered Speech via Post-Processing
Post-processing with an encoder-decoder model yields 22% relative EER reduction on normal-vs-whispered trials and 1.88% EER on whispered-vs-whispered, outperforming ReDimNet-B2.
-
Spoken Language Identification with Pre-trained Models and Margin Loss
Pre-trained ECAPA-TDNN with margin losses reaches 85.95% macro and 90.96% micro accuracy on language identification plus 17.08% EER on verification, beating the official baseline by 45.7%, 15.2%, and 50.8% respectively.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.