A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.
Wavlm: Large-scale self- supervised pre-training for full stack speech processing
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
dataset 1polarities
use dataset 1representative citing papers
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.
An adaptive cross-modal gating network improves depression detection from speech by selectively weighting sparse relevant segments across acoustic and textual modalities.
Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.
citing papers explorer
-
Joint Fullband-Subband Modeling for High-Resolution SingFake Detection
A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.
-
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
-
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
-
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
-
Contrastive Regularization for Accent-Robust ASR
Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.
-
Learning to Attend to Depression-Related Patterns: An Adaptive Cross-Modal Gating Network for Depression Detection
An adaptive cross-modal gating network improves depression detection from speech by selectively weighting sparse relevant segments across acoustic and textual modalities.
-
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan
Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.
- Unmixing The Crowd: Learning Persistent Speaker Representations from Mixture-Derived Multi-Speaker Embeddings
- WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models