wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli · 2020

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.

The Indra Representation Hypothesis for Multimodal Alignment

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.

emg2speech: Synthesizing speech from electromyography using self-supervised speech models

cs.SD · 2025-10-28 · conditional · novelty 6.0

EMG signals from orofacial muscles are mapped via linear transformation into self-supervised speech representation space to enable direct audio synthesis, shown on an ALS patient during silent articulation.

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

cs.CL · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

Scaling Properties of Continuous Diffusion Spoken Language Models

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

citing papers explorer

Showing 6 of 6 citing papers.

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification cs.CV · 2026-05-13 · unverdicted · none · ref 3
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
The Indra Representation Hypothesis for Multimodal Alignment cs.CV · 2026-04-06 · unverdicted · none · ref 2
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 1
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
emg2speech: Synthesizing speech from electromyography using self-supervised speech models cs.SD · 2025-10-28 · conditional · none · ref 18
EMG signals from orofacial muscles are mapped via linear transformation into self-supervised speech representation space to enable direct audio synthesis, shown on an ALS patient during silent articulation.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 26 · 2 links
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Scaling Properties of Continuous Diffusion Spoken Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 2
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer