Wavlm: Large-scale self- supervised pre-training for full stack speech processing

· 2022

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Joint Fullband-Subband Modeling for High-Resolution SingFake Detection

cs.SD · 2026-04-06 · unverdicted · novelty 7.0

A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

cs.SD · 2026-03-05 · unverdicted · novelty 7.0

MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.

Multimodal LLMs are not all you need for Pediatric Speech Language Pathology

cs.CL · 2026-04-29 · unverdicted · novelty 5.0

Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.

Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.

Contrastive Regularization for Accent-Robust ASR

cs.SD · 2026-05-05 · unverdicted · novelty 4.0

Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.

Learning to Attend to Depression-Related Patterns: An Adaptive Cross-Modal Gating Network for Depression Detection

cs.SD · 2026-04-11 · unverdicted · novelty 4.0

An adaptive cross-modal gating network improves depression detection from speech by selectively weighting sparse relevant segments across acoustic and textual modalities.

Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

eess.AS · 2026-04-06 · unverdicted · novelty 4.0 · 2 refs

Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.

Unmixing The Crowd: Learning Persistent Speaker Representations from Mixture-Derived Multi-Speaker Embeddings

eess.AS · 2026-04-03

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

cs.CL · 2026-03-17

citing papers explorer

Showing 9 of 9 citing papers.

Joint Fullband-Subband Modeling for High-Resolution SingFake Detection cs.SD · 2026-04-06 · unverdicted · none · ref 30
A joint fullband-subband model using high-resolution 44.1 kHz audio outperforms standard 16 kHz detectors for singing voice deepfake detection by exploiting spectrum-specific synthesis artifacts.
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection cs.SD · 2026-03-05 · unverdicted · none · ref 44
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology cs.CL · 2026-04-29 · unverdicted · none · ref 29
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR cs.CL · 2026-04-07 · unverdicted · none · ref 24
Mixed batching with only 10% target-domain speech achieves word error rates matching or exceeding conventional full-dataset ASR fine-tuning in LLM-based models.
Contrastive Regularization for Accent-Robust ASR cs.SD · 2026-05-05 · unverdicted · none · ref 9
Supervised contrastive learning as an auxiliary loss during CTC fine-tuning improves accent robustness in ASR, yielding up to 29% relative WER reduction on unseen accents.
Learning to Attend to Depression-Related Patterns: An Adaptive Cross-Modal Gating Network for Depression Detection cs.SD · 2026-04-11 · unverdicted · none · ref 14
An adaptive cross-modal gating network improves depression detection from speech by selectively weighting sparse relevant segments across acoustic and textual modalities.
Exploring Speech Foundation Models for Speaker Diarization Across Lifespan eess.AS · 2026-04-06 · unverdicted · none · ref 9 · 2 links
Cross-lifespan evaluation shows adult-trained speech foundation models degrade on child and older-adult data, with joint multi-age training and targeted adaptation improving robustness especially using Whisper encoder.
Unmixing The Crowd: Learning Persistent Speaker Representations from Mixture-Derived Multi-Speaker Embeddings eess.AS · 2026-04-03 · unreviewed · ref 18
WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models cs.CL · 2026-03-17 · unreviewed · ref 37

Wavlm: Large-scale self- supervised pre-training for full stack speech processing

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer