pith. machine review for the scientific record. sign in

arxiv: 2603.01482 · v1 · submitted 2026-03-02 · 📡 eess.AS · cs.AI· cs.LG· eess.SP

Recognition: unknown

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Authors on Pith no claims yet
classification 📡 eess.AS cs.AIcs.LGeess.SP
keywords modelsaudiobenchmarkdeepfakedetectiondiscriminativespeechgenerative
0
0 comments X
read the original abstract

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

    eess.AS 2026-04 unverdicted novelty 4.0

    Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.