arxiv: 2604.26057 · v1 · submitted 2026-04-28 · 📡 eess.AS · cs.LG

Recognition: unknown

Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

Jaskirat Sudan , Hashim Ali , Surya Subramani , Hafiz Malik

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:44 UTC · model grok-4.3

classification 📡 eess.AS cs.LG

keywords supervised contrastive learningdeepfake audio detectionsimilarity measuresnegative scalingequal error ratewav2vec2ASVspoof

0 comments

The pith

Cosine similarity with a delayed negative queue in supervised contrastive learning yields the lowest equal error rates for deepfake audio detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled study on supervised contrastive learning applied to audio deepfake detection with a wav2vec2 XLS-R encoder. It isolates the effects of two similarity functions inside the contrastive loss and the use of a warm-started global cross-batch queue for negative scaling. Cosine similarity paired with the delayed queue records the best in-the-wild EER of 8.29 percent and pooled EER of 4.44, while angular similarity achieves competitive results without the queue. A reader would care because these modest changes to the contrastive objective appear to improve detection reliability on both standard benchmarks and in-the-wild data without enlarging the model or dataset.

Core claim

After stage-one fine-tuning of the encoder and projection head with the supervised contrastive objective on ASVspoof 2019 LA, followed by stage-two training of a linear classifier with binary cross-entropy, cosine similarity combined with a delayed queue produces superior equal error rates on ASVspoof 2019 evaluation, ASVspoof 2021 DF and LA, and in-the-wild test sets compared with angular similarity or absence of the queue.

What carries the argument

The supervised contrastive loss computed with either cosine or hyperspherical angular similarity, augmented by a warm-started global cross-batch queue that supplies and scales negative samples.

If this is right

Cosine SupCon with a delayed queue reduces dependence on very large negative sets while still improving detection.
Angular similarity supports strong performance even when queued negatives are unavailable.
The two-stage separation of contrastive representation learning from linear classification remains effective across the tested similarity variants.
Gains appear consistently on both ASVspoof 2021 and in-the-wild evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same similarity and queue ablations could be tested on other audio classification tasks that use contrastive pre-training.
Lower reliance on negative-sample volume may reduce training memory and compute costs in larger-scale audio models.
Parallel controlled studies on similarity choice might reveal similar patterns in contrastive learning for image or video deepfake detection.

Load-bearing premise

Observed EER differences arise chiefly from the similarity function and negative-scaling choices rather than from interactions with other training details or dataset specifics.

What would settle it

Re-training the identical pipeline while changing only the similarity function and queue delay, then checking whether the reported EER gaps disappear.

Figures

Figures reproduced from arXiv: 2604.26057 by Hafiz Malik, Hashim Ali, Jaskirat Sudan, Surya Subramani.

**Figure 1.** Figure 1: Model Architecture 2. Related Work Recent work has used supervised contrastive learning (SupCon) to improve robustness, but typically by engineering hard negatives and structuring mini-batches. Trident of Poseidon introduces a triad training strategy that combines supervised contrastive learning with hard negative mining through audio re-synthesis and proactive batch sampling to balance bona fide/spoof co… view at source ↗

**Figure 2.** Figure 2: t-SNE visualization of ITW embeddings across temperatures (no queue). Rows correspond to cosine and geodesic similarity; columns to τ ∈ {0.07, 0.1, 0.3, 0.6}. dimension 1024 × T. A frame-wise linear projection (1024 to 256) followed by temporal mean pooling produces a 256- dimensional utterance-level embedding, which is ℓ2-normalized before the contrastive objective or classifier is applied ( view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of ITW embeddings across queue sizes. Rows correspond to cosine (τ=0.30) and geodesic (τ=0.07); columns to |Q| ∈ {0, 256, 1024, 4096}. (iii) Queue-assisted SupCon: identical to (ii) but augments Stage 1 negatives with a delayed global FIFO queue. The queue is enabled after Estart=6 epochs to reduce early representation drift, and contributes negatives only to the SupCon denominator posi… view at source ↗

**Figure 4.** Figure 4: EER across datasets for cosine vs. geodesic and different queue sizes. well-calibrated softmax distribution. Cosine similarity requires a higher temperature to compensate for its vanishing gradients near θij ≈ 0 and θij ≈ π. The t-SNE panels in view at source ↗

read the original abstract

Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cosine SupCon with a delayed queue edges out the other variants on ITW EER in this audio deepfake setup, but the gains rest on a moderate-evidence ablation that may not fully isolate the two factors.

read the letter

The key point is that cosine similarity plus a delayed global cross-batch queue in the SupCon stage reaches the lowest ITW EER of 8.29% and pooled EER of 4.44%, while angular similarity without the queue lands at 8.70% ITW. This comes from a two-stage pipeline on wav2vec2 XLS-R: SupCon fine-tuning of encoder and projection head, then frozen linear head trained with BCE on ASVspoof 2019 LA data, tested on the usual eval sets plus ITW and ASVspoof 2021 DF/LA.

Referee Report

2 major / 2 minor

Summary. The paper conducts a controlled study on supervised contrastive learning (SupCon) for deepfake audio detection with wav2vec2 XLS-R (300M). It varies similarity (cosine vs. angular, derived from hyperspherical angle) and negative scaling (delayed global cross-batch queue vs. none) during Stage-1 SupCon fine-tuning of the encoder and projection head; Stage 2 freezes the model and trains a linear head with BCE. Trained on ASVspoof 2019 LA and evaluated on ASVspoof 2019 eval, ITW, and ASVspoof 2021 DF/LA, the results indicate that cosine SupCon with the delayed queue yields the lowest ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs competitively without queued negatives (ITW EER 8.70%), suggesting reduced reliance on large negative sets.

Significance. If the attribution to similarity choice and queue-based scaling holds after proper controls, the work supplies targeted, practical guidance on SupCon design for audio deepfake detection. The concrete EER numbers and the efficiency observation for angular similarity could inform representation-learning pipelines that must operate with limited negative samples or compute, especially when building on large pre-trained models such as XLS-R.

major comments (2)

[Abstract / §4 (Experiments)] Abstract and experimental description: the reported EER ranking (cosine+queue at 8.29% ITW vs. angular+no-queue at 8.70%) is presented as evidence that the two varied axes drive performance, yet the manuscript provides no explicit statement or ablation confirming that temperature, projection-head dimension, learning-rate schedule, augmentations, and queue hyperparameters were held fixed across runs. Without such isolation the performance gap cannot be confidently assigned to similarity and negative scaling rather than unablated interactions.
[Abstract / Results tables] Results presentation: the abstract and any accompanying tables report point EER values (e.g., 8.29%, 4.44, 8.70) with no error bars, standard deviations across random seeds, or statistical significance tests. This omission weakens the claim that one configuration is reliably superior and makes it impossible to judge whether observed differences exceed typical training variance.

minor comments (2)

[Abstract] The abstract would be clearer if it explicitly named the training set (ASVspoof 2019 LA) and all evaluation sets in a single sentence.
[Results] A consolidated table listing EER for every similarity/queue combination on every test set would improve readability and allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below.

read point-by-point responses

Referee: [Abstract / §4 (Experiments)] Abstract and experimental description: the reported EER ranking (cosine+queue at 8.29% ITW vs. angular+no-queue at 8.70%) is presented as evidence that the two varied axes drive performance, yet the manuscript provides no explicit statement or ablation confirming that temperature, projection-head dimension, learning-rate schedule, augmentations, and queue hyperparameters were held fixed across runs. Without such isolation the performance gap cannot be confidently assigned to similarity and negative scaling rather than unablated interactions.

Authors: We agree that making the controlled nature of the study more explicit would strengthen the paper. Although the manuscript describes the work as a 'controlled study' varying only the two specified factors, we will add a clear statement in the revised Section 4 confirming that temperature, projection-head dimension, learning-rate schedule, augmentations, and queue hyperparameters were held fixed across all runs. This will better isolate the effects of similarity choice and negative scaling. revision: yes
Referee: [Abstract / Results tables] Results presentation: the abstract and any accompanying tables report point EER values (e.g., 8.29%, 4.44, 8.70) with no error bars, standard deviations across random seeds, or statistical significance tests. This omission weakens the claim that one configuration is reliably superior and makes it impossible to judge whether observed differences exceed typical training variance.

Authors: We recognize that including measures of variability would enhance confidence in the results. Due to the high computational cost associated with fine-tuning the large XLS-R model, each configuration was evaluated with a single training run. The performance gaps (e.g., 0.41% EER difference on ITW) are substantial relative to expected variance in such tasks, but we will include a note in the revised manuscript acknowledging the single-run limitation and the absence of statistical significance testing. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical comparisons rest on direct experiments

full rationale

The paper reports results from a controlled experimental study on wav2vec2 XLS-R using SupCon in stage 1 followed by BCE in stage 2. It varies only similarity function (cosine vs. angular) and negative scaling (delayed queue vs. none), then measures EER on ASVspoof 2019 LA training with evaluation on ASV19, ITW, and ASVspoof 2021. All reported numbers (e.g., 8.29% ITW EER for cosine+queue) are obtained by training and testing; no equations, predictions, or first-principles derivations are presented that could reduce to the inputs by construction. No self-citations appear in the provided text, and no fitted parameters are relabeled as predictions. The work is self-contained against external benchmarks via explicit dataset splits and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is a purely empirical ablation study. It relies on standard supervised learning assumptions (i.i.d. data, representativeness of ASVspoof corpora) and common ML hyperparameters but introduces no new free parameters, axioms, or invented entities beyond those already standard in contrastive learning pipelines.

pith-pipeline@v0.9.0 · 5503 in / 1083 out tokens · 57542 ms · 2026-05-07T13:44:51.087442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

Introduction Recent advances in neural text-to-speech (TTS) and voice con- version (VC) systems have drastically improved the realism of synthetic speech. These modern generative systems are capa- ble of synthesizing speech that is perceptually indistinguish- able from genuine recordings. While these technologies enable beneficial applications including p...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Related Work Recent work has used supervised contrastive learning (Sup- Con) to improve robustness, but typically byengineering hard negativesandstructuring mini-batches. Trident of Poseidon introduces a triad training strategy that combines supervised contrastive learning with hard negative mining through audio re-synthesis and proactive batch sampling t...
[3]

The encoder architecture, projection dimension, pooling strategy, Stage 2 classifier, optimizer, and augmentation policy are fixed across all experiments

Method We study supervised contrastive learning for deepfake speech detection in a controlled two-stage XLS-R pipeline. The encoder architecture, projection dimension, pooling strategy, Stage 2 classifier, optimizer, and augmentation policy are fixed across all experiments. We vary only two factors in Stage 1: the similarity function used inside SupCon (S...
[4]

Datasets and evaluation protocol We train all models on the ASVspoof 2019 Logical Access (LA)trainsplit and select checkpoints using the officialdev split [3]

Experimental Setup 4.1. Datasets and evaluation protocol We train all models on the ASVspoof 2019 Logical Access (LA)trainsplit and select checkpoints using the officialdev split [3]. No target-domain data from ASVspoof 2021 or ITW is used during training or model selection. In-domain per- formance is reported on ASVspoof 2019 LAeval. To assess cross-data...

2019
[5]

Relative to the BCE base- line (pooled EER7.27), SupCon improves cross-dataset per- formance whenτis tuned, but the optimalτdiffers by sim- ilarity

Results To isolate the effect of the similarity geometry and the temper- atureτ, we first perform a temperature sweep for cosine- and geodesic-based SupCon (Table 1). Relative to the BCE base- line (pooled EER7.27), SupCon improves cross-dataset per- formance whenτis tuned, but the optimalτdiffers by sim- ilarity. For cosine,τ= 0.30yields the best pooled ...

2048
[6]

Discussion 6.1. Effect of temperature and similarity choice Table 1 shows that the optimal temperature differs substantially between the two similarity functions: cosine peaks atτ=0.30 while geodesic peaks atτ=0.07, a fourfold difference. This is consistent with the gradient dynamics argument in Section 3.3: geodesic’s linear dependence onθ ij makes the l...

2048
[7]

Conclusion and Future Work We presented a controlled study of two SupCon design choices similarity function and negative scaling for cross-dataset deep- fake audio detection with XLS-R. Our results show that sim- ilarity choice and temperature are coupled: geodesic similar- ity achieves strong OOD performance at a much lower tem- perature than cosine, con...

2048
[8]

Add 2022: the first audio deep synthesis detection challenge,

J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, X. Zhang, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, H. Li, Z. Lian, and B. Liu, “Add 2022: the first audio deep synthesis detection challenge,” 2024. [Online]. Available: https://arxiv.org/abs/2202.08433

work page arXiv 2022
[9]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” 2022. [Online]. Available: https://arxiv.org/abs/2202.12233

work page arXiv 2022
[10]

Asvspoof 2019: Future horizons in spoofed and fake audio detection,

M. e. a. Todisco, “Asvspoof 2019: Future horizons in spoofed and fake audio detection,” inInterspeech, 2019

2019
[11]

Asvspoof 2021: accelerating progress in spoofed and deep- fake speech detection,

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evanset al., “Asvspoof 2021: accelerating progress in spoofed and deep- fake speech detection,” inASVspoof 2021 Workshop-Automatic Speaker Verification and Spoofing Coutermeasures Challenge, 2021

2021
[12]

Is audio spoof detection robust to laundering attacks?

H. Ali, S. Subramani, S. Sudhir, R. Varahamurthy, and H. Malik, “Is audio spoof detection robust to laundering attacks?” inPro- ceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, 2024, pp. 283–288

2024
[13]

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

H. Ali, N. S. Adupa, S. Subramani, and H. Malik, “A superb-style benchmark of self-supervised speech models for audio deepfake detection,” 2026. [Online]. Available: https: //arxiv.org/abs/2603.01482

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Spoofed speech detection with a focus on speaker embedding

H. M. Tran, D. Guennec, P. Martin, A. Sini, D. Lolive, A. Del- hay, and P.-F. Marteau, “Spoofed speech detection with a focus on speaker embedding.” inInterspeech, 2024

2024
[15]

Trident of poseidon: A generalized approach for detecting deepfake voices,

T.-P. Doan, H. Dinh-Xuan, T. Ryu, I. Kim, W. Lee, K. Hong, and S. Jung, “Trident of poseidon: A generalized approach for detecting deepfake voices,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 2222–2235. [Online]. Available: http...

work page doi:10.1145/3658644.3690311 2024
[16]

Balance, multi- ple augmentation, and re-synthesis: A triad training strategy for enhanced audio deepfake detection,

P. Doan, L. Nguyen Vu, K. Hong, and S. Jung, “Balance, multi- ple augmentation, and re-synthesis: A triad training strategy for enhanced audio deepfake detection,” 09 2024, pp. 2105–2109

2024
[17]

Clad: Robust audio deepfake detection against manipulation attacks with contrastive learning,

H. Wu, J. Chen, R. Du, C. Wu, K. He, X. Shang, H. Ren, and G. Xu, “Clad: Robust audio deepfake detection against manipulation attacks with contrastive learning,” 2024. [Online]. Available: https://arxiv.org/abs/2404.15854

work page arXiv 2024
[18]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” 2021. [Online]. Available: https://arxiv.org/abs/2004. 11362

2021
[19]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,”
[20]

A Simple Framework for Contrastive Learning of Visual Representations

[Online]. Available: https://arxiv.org/abs/2002.05709

work page internal anchor Pith review arXiv 2002
[21]

Momentum contrast for unsupervised visual representation learning

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020. [Online]. Available: https://arxiv.org/abs/1911.05722

work page arXiv 2020
[22]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2019. [Online]. Available: https://arxiv.org/abs/1807.03748

work page internal anchor Pith review arXiv 2019
[23]

Contrastive learning with hard negative samples,

J. Robinson, C.-Y . Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” 2021. [Online]. Available: https://arxiv.org/abs/2010.04592

work page arXiv 2021
[24]

Cross-batch memory for embedding learning,

X. Wang, H. Zhang, W. Huang, and M. R. Scott, “Cross-batch memory for embedding learning,” 2020. [Online]. Available: https://arxiv.org/abs/1912.06798

work page arXiv 2020
[25]

Does Audio Deepfake Detection Generalize?

N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B ¨ottinger, “Does Audio Deepfake Detection Generalize?” Apr. 2022, arXiv:2203.16263 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2203.16263

work page arXiv 2022
[26]

A gradient accumulation method for dense retriever under memory constraint,

J. Kim, Y . Lee, and P. Kang, “A gradient accumulation method for dense retriever under memory constraint,” 2024. [Online]. Available: https://arxiv.org/abs/2406.12356

work page arXiv 2024
[27]

Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386

2022
[28]

Xls-r: Self-supervised cross- lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “Xls-r: Self-supervised cross- lingual speech representation learning at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2111.09296

work page arXiv 2021