Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning

Cristian-Teodor Neamtu; Dan Oneata; Dragos Burileanu; Horia Cucu; Serban Mihalache; Stefan Smeu

arxiv: 2606.10758 · v1 · pith:BPGIB6WGnew · submitted 2026-06-09 · 📡 eess.AS

Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning

Cristian-Teodor Neamtu , Serban Mihalache , Stefan Smeu , Dan Oneata , Horia Cucu , Dragos Burileanu This is my paper

Pith reviewed 2026-06-27 11:46 UTC · model grok-4.3

classification 📡 eess.AS

keywords TTS source attributionopen-set detectionProxy-Anchor lossWav2Vec2-BERTaudio forensicsOOD detectionmetric learningMLAAD dataset

0 comments

The pith

Proxy-Anchor metric learning attributes 110 known TTS systems at 99.76 percent accuracy while flagging unseen ones at 2.04 percent false positive rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a metric learning approach can trace which text-to-speech system produced a given audio clip, even when the system has never been seen before. It builds an embedding space from Wav2Vec2-BERT features using the Proxy-Anchor loss, after first merging versions of the same TTS architecture into single training classes. If this holds, audio forensics gains the ability to both name the source among known generators and reliably mark unknown ones as out-of-distribution. The results are reported on a dataset of 140 systems across 51 languages, with a large improvement shown on an earlier public split.

Core claim

The central claim is that Proxy-Anchor loss applied to Wav2Vec2-BERT embeddings, combined with architecture merging for class design and post-hoc OOD scoring, forms an effective framework for closed-set TTS source attribution and open-set detection of unseen systems, demonstrated by 99.76 percent accuracy on 110 in-distribution classes and an FPR@95 of 2.04 percent on the MLAAD v9 dataset, plus nearly doubled OOD accuracy on MLAAD v5 splits.

What carries the argument

Proxy-Anchor loss operating on Wav2Vec2-BERT embeddings, with architecture merging to define unified classes and post-hoc scoring for out-of-distribution detection.

If this is right

Source attribution becomes feasible for more than 100 distinct TTS architectures in closed-set conditions.
Unknown TTS systems can be flagged without retraining when they appear in new audio.
The same embedding space supports both identification of known sources and rejection of unknown ones.
Architecture merging lowers inter-class confusion across multilingual data.
Performance gains on prior dataset splits indicate the framework improves on existing methods for open-set detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loss and embedding pipeline could be tested on other audio generation tasks such as voice conversion or music synthesis.
Adding more diverse or adversarial TTS examples would test whether the separation between known and unknown remains stable at scale.
The architecture merging step suggests a general preprocessing tactic for any domain where multiple versions of the same generator exist.
Real-time forensic pipelines could incorporate the OOD score as an alert threshold before attempting attribution.

Load-bearing premise

Merging TTS versions into architecture-based classes reduces confusion without discarding useful differences, and the learned embeddings plus scoring can separate known systems from truly unseen ones.

What would settle it

Testing the trained model on audio from a fresh set of TTS systems absent from both training and the MLAAD collection, then measuring whether the false positive rate for marking them out-of-distribution stays under 5 percent.

Figures

Figures reproduced from arXiv: 2606.10758 by Cristian-Teodor Neamtu, Dan Oneata, Dragos Burileanu, Horia Cucu, Serban Mihalache, Stefan Smeu.

read the original abstract

The proliferation of text-to-speech (TTS) systems capable of generating realistic synthetic speech poses growing challenges for audio forensics. While binary deepfake detection has received considerable attention, source tracing (i.e., identifying which TTS system produced a given audio sample) remains underexplored, particularly in open-set scenarios where unknown systems may be encountered. We propose a metric learning framework based on the Proxy-Anchor loss function that operates on Wav2Vec2-BERT embeddings to learn a discriminative embedding space for TTS source attribution and out-of-distribution (OOD) detection of unseen systems. We evaluate it on the MLAAD v9 dataset spanning 140 TTS systems across 51 languages, and introduce an architecture merging strategy that groups TTS system versions into unified classes, reducing inter-class confusion. Our system achieves 99.76% accuracy on 110 in-distribution classes and a False Positive Rate (FPR@95) as low as 2.04% for OOD detection. Also, for a fair comparison against the current state of the art, we further evaluate it on the MLAAD v5 official dataset splits, improving the OOD accuracy by almost doubling it. These results demonstrate that Proxy-Anchor metric learning, combined with architecture-aware class design and post-hoc OOD scoring, provides an effective framework for forensic TTS source tracing in both closed-set and open-set settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies Proxy-Anchor loss to Wav2Vec2-BERT embeddings for open-set TTS attribution and reports strong numbers, but the architecture-merging step lacks supporting ablations.

read the letter

The paper's core move is to take the established Proxy-Anchor loss, run it on Wav2Vec2-BERT embeddings, and add a preprocessing step that collapses TTS system versions into architecture-level classes. On the MLAAD v9 set this yields 99.76 % closed-set accuracy over 110 classes and an FPR@95 of 2.04 % for OOD detection; the same pipeline also improves OOD accuracy on the MLAAD v5 splits. That combination and the scale of the evaluation are new relative to the cited prior work.

The architecture-merging step is presented as reducing inter-class confusion without much loss of signal. No ablation is described that measures intra- versus inter-architecture variance in the embedding space, nor is there a direct comparison of performance on the original 140-system partition versus the merged 110-class version. Without those checks it is difficult to know whether the reported gains come from the loss and embeddings or from the label-space simplification itself.

The abstract gives no information on statistical significance, error bars, data splits, or the exact baselines used for the “almost doubling” claim. Those omissions make the numerical results hard to interpret at face value.

The work is aimed at audio-forensics researchers who already track metric-learning methods and need a concrete open-set baseline on public TTS data. It is coherent on its own terms and engages the relevant literature, so it is worth sending to referees even though the merging claim and experimental details will need tightening.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes a Proxy-Anchor metric learning framework operating on Wav2Vec2-BERT embeddings for closed-set TTS source attribution and open-set OOD detection of unseen systems. It introduces an architecture merging strategy that consolidates 140 TTS systems into 110 classes on the MLAAD v9 dataset, reporting 99.76% closed-set accuracy and 2.04% FPR@95 for OOD detection, along with nearly doubled OOD accuracy on the MLAAD v5 official splits relative to prior work.

Significance. If the central empirical claims hold after addressing the noted gaps, the work would provide a concrete, reproducible advance in open-set forensic attribution of synthetic speech by showing that proxy-anchor embeddings combined with architecture-aware class design can achieve high closed-set accuracy while maintaining low false-positive rates on unseen TTS systems. The use of public datasets and official splits is a positive factor for verifiability.

major comments (2)

[Abstract and §4] Abstract and §4: The headline performance figures (99.76% accuracy on 110 classes, FPR@95 = 2.04%) rest on the architecture merging step, yet no ablation isolating the effect of merging (e.g., 140-class vs. 110-class embedding geometry or intra- vs. inter-architecture variance in the Wav2Vec2-BERT space) is reported; without this check the claim that merging reduces confusion without discarding discriminative cues remains unverified and load-bearing for both the closed-set and OOD results.
[§4.2] §4.2 (or equivalent experimental subsection): The reported doubling of OOD accuracy on MLAAD v5 splits is presented without the exact baseline method name, its numerical score, the precise OOD metric definition, or statistical significance testing across data splits; this prevents direct assessment of whether the improvement is attributable to the proposed framework rather than post-hoc tuning of the free parameters (Proxy-Anchor margin/scale and merging thresholds).

minor comments (3)

[Abstract] Abstract: The statement 'improving the OOD accuracy by almost doubling it' should be replaced by explicit numerical comparison (baseline value and proposed value) for precision.
Throughout: No error bars, number of random seeds, or cross-validation details are mentioned for the reported accuracy and FPR figures, which is standard for empirical ML claims even if not load-bearing for the core derivation.
[Method] Method section: The precise definition of the post-hoc OOD scoring function (e.g., distance threshold or density estimator) should be stated explicitly with equation reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback highlights opportunities to strengthen the experimental validation of the architecture merging strategy and the OOD comparison on MLAAD v5. We address each point below and commit to revisions that will make the claims more verifiable without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The headline performance figures (99.76% accuracy on 110 classes, FPR@95 = 2.04%) rest on the architecture merging step, yet no ablation isolating the effect of merging (e.g., 140-class vs. 110-class embedding geometry or intra- vs. inter-architecture variance in the Wav2Vec2-BERT space) is reported; without this check the claim that merging reduces confusion without discarding discriminative cues remains unverified and load-bearing for both the closed-set and OOD results.

Authors: We agree that an ablation isolating the merging effect would make the contribution of this design choice more transparent. In the revised manuscript we will add a targeted analysis comparing the 140-class and 110-class configurations on the same Wav2Vec2-BERT embeddings, reporting intra- versus inter-architecture variance and nearest-neighbor confusion matrices. This will directly verify that merging reduces inter-class overlap while retaining discriminative information. revision: yes
Referee: [§4.2] §4.2 (or equivalent experimental subsection): The reported doubling of OOD accuracy on MLAAD v5 splits is presented without the exact baseline method name, its numerical score, the precise OOD metric definition, or statistical significance testing across data splits; this prevents direct assessment of whether the improvement is attributable to the proposed framework rather than post-hoc tuning of the free parameters (Proxy-Anchor margin/scale and merging thresholds).

Authors: We will expand §4.2 to name the exact baseline, quote its published OOD accuracy, specify the metric definition used for the doubling claim, and add results with standard deviation and significance testing across the official MLAAD v5 splits. These additions will allow readers to confirm that the observed improvement is attributable to the Proxy-Anchor framework rather than parameter selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML pipeline on public dataset

full rationale

The paper reports closed-set accuracy (99.76% on 110 merged classes) and OOD FPR@95 (2.04%) from training a Proxy-Anchor model on Wav2Vec2-BERT embeddings of the MLAAD v9 dataset, plus a comparison on MLAAD v5 splits. No equations, derivations, or fitted-parameter predictions appear in the provided text; architecture merging is presented as an empirical preprocessing choice whose impact is measured directly on held-out data rather than defined into the metrics. Proxy-Anchor loss is a standard external reference, not a self-citation whose uniqueness theorem bears the central claim. The reported numbers are therefore falsifiable experimental outcomes, not quantities forced by construction from the paper's own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Standard supervised metric-learning setup relying on pre-trained embeddings and loss hyperparameters tuned to the target dataset; no new physical entities introduced.

free parameters (2)

Proxy-Anchor loss margin and scale parameters
Typical hyperparameters in metric learning that are fitted or chosen on validation data to achieve the reported separation.
Architecture merging thresholds for grouping TTS versions
Ad-hoc grouping rule introduced to reduce inter-class confusion; exact criteria not specified in abstract.

axioms (1)

domain assumption Wav2Vec2-BERT embeddings contain sufficient information to discriminate TTS system identity
The framework presupposes that features from this particular pre-trained model are suitable for the attribution task.

pith-pipeline@v0.9.1-grok · 5800 in / 1274 out tokens · 36401 ms · 2026-06-27T11:46:32.302787+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,” Jan
[2]

arXiv:2301.02111 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[3]

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,

X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y . Liu, “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,” May
[4]

arXiv:2205.04421 [eess]

work page arXiv
[5]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,

E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. G ¨olge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” Apr. 2023. arXiv:2112.02418 [cs]

work page arXiv 2023
[6]

ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023. arXiv:2210.02437 [cs]

work page arXiv 2021
[7]

Source Tracing of Audio Deepfake Systems,

N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury, “Source Tracing of Audio Deepfake Systems,” inInterspeech 2024, pp. 1100–1104, Sept

2024
[8]

arXiv:2407.08016 [eess]

work page arXiv
[9]

ASVspoof 2019: A large-scale public database of synthesized, con- verted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.- C. Huang, T. Toda, K...

work page arXiv 2019
[10]

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G¨olge, T. M¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” Jan. 2026. arXiv:2401.09512 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Open-Set Source Tracing of Audio Deepfake Systems,

N. Klein, H. Tak, and E. Khoury, “Open-Set Source Tracing of Audio Deepfake Systems,” inInterspeech 2025, pp. 1578–1582, ISCA, Aug. 2025

2025
[12]

VIB-based Real Pre-emphasis Audio Deepfake Source Tracing,

T.-P. Doan, K. Hong, and S. Jung, “VIB-based Real Pre-emphasis Audio Deepfake Source Tracing,” inInterspeech 2025, pp. 1568–1572, ISCA, Aug. 2025

2025
[13]

Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,

P. Falez, T. Marteau, D. Lolive, and A. Delhay, “Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,” inInterspeech 2025, pp. 1528–1532, ISCA, Aug. 2025

2025
[14]

Syn- thetic Speech Source Tracing using Metric Learning,

D. Koutsianos, S. Zacharopoulos, Y . Panagakis, and T. Stafylakis, “Syn- thetic Speech Source Tracing using Metric Learning,” inInterspeech 2025, pp. 1558–1562, ISCA, Aug. 2025

2025
[15]

Source Verification for Speech Deepfakes,

V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Source Verification for Speech Deepfakes,” inInterspeech 2025, pp. 1548–1552, ISCA, Aug. 2025

2025
[16]

TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deepfakes,

A. Stan, D. Combei, D. Oneata, and H. Cucu, “TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deepfakes,” in Interspeech 2025, pp. 1543–1547, ISCA, Aug. 2025

2025
[17]

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark,

X. Xuan, Y . Xiao, R. K. Das, and T. Kinnunen, “Multilingual Source Tracing of Speech Deepfakes: A First Benchmark,” Aug. 2025. arXiv:2508.04143 [eess]

work page arXiv 2025
[18]

Seamless: Multilingual expressive and streaming speech translation,

S. Communication, L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, and N. Dong, “Seamless: Multilingual Expressive and Streaming Speech Translation,” Dec. 2023. arXiv:2312.05187 [cs.CL]

work page arXiv 2023
[19]

Com- prehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection,

Y . E. Kheir, Y . Samih, S. Maharjan, T. Polzehl, and S. M ¨oller, “Com- prehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection,” Feb. 2025. arXiv:2502.03559 [eess]

work page arXiv 2025
[20]

Detecting audio deepfakes on the edge: Lightweight ssl-based detection in a browser plugin,

O. Pascu, D. Oneata, H. Cucu, and N. M. M ¨uller, “Detecting audio deepfakes on the edge: Lightweight ssl-based detection in a browser plugin,” in2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 17–22, IEEE, 2025

2025
[21]

Investigating self-supervised front ends for speech spoofing countermeasures,

X. Wang and J. Yamagishi, “Investigating self-supervised front ends for speech spoofing countermeasures,” inOdyssey 2022: The Speaker and Language Recognition Workshop, pp. 100–106, 2022

2022
[22]

Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection,

J. M. Mart ´ın-Do˜nas, A. ´Alvarez, E. Rosello, A. M. Gomez, and A. M. Peinado, “Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection,” inInterspeech 2024, pp. 2085–2089, ISCA, Sept. 2024

2024
[23]

Towards gen- eralisable and calibrated synthetic speech detection with self-supervised representations,

O. Pascu, A. Stan, D. Oneata, E. Oneata, and H. Cucu, “Towards gen- eralisable and calibrated synthetic speech detection with self-supervised representations,” inInterspeech, 2024

2024
[24]

WavLM model ensemble for audio deepfake detection,

D. Combei, A. Stan, D. Oneata, and H. Cucu, “WavLM model ensemble for audio deepfake detection,”arXiv preprint arXiv:2408.07414, 2024

work page arXiv 2024
[25]

Improved deepfake detection using Whisper features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved deepfake detection using Whisper features,” inInterspeech, 2023

2023
[26]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, p. 1505–1518, Oct. 2022

2022
[27]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021

2021
[28]

Proxy Anchor Loss for Deep Metric Learning,

S. Kim, D. Kim, M. Cho, and S. Kwak, “Proxy Anchor Loss for Deep Metric Learning,” Mar. 2020. arXiv:2003.13911 [cs]

work page arXiv 2020
[29]

Using MLAAD for source tracing of audio deepfakes

N. M ¨uller, “Using MLAAD for source tracing of audio deepfakes.” https: //deepfake-total.com/sourcetracing, 11 2024

2024
[30]

Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,

A. Kulkarni, S. Dowerah, T. Alum ¨ae, and M. M. Doss, “Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,” inInterspeech 2025, pp. 1533–1537, ISCA, Aug. 2025

2025

[1] [1]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,” Jan

[2] [2]

arXiv:2301.02111 [cs]

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,

X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y . Liu, “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,” May

[4] [4]

arXiv:2205.04421 [eess]

work page arXiv

[5] [5]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,

E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. G ¨olge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” Apr. 2023. arXiv:2112.02418 [cs]

work page arXiv 2023

[6] [6]

ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023. arXiv:2210.02437 [cs]

work page arXiv 2021

[7] [7]

Source Tracing of Audio Deepfake Systems,

N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury, “Source Tracing of Audio Deepfake Systems,” inInterspeech 2024, pp. 1100–1104, Sept

2024

[8] [8]

arXiv:2407.08016 [eess]

work page arXiv

[9] [9]

ASVspoof 2019: A large-scale public database of synthesized, con- verted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.- C. Huang, T. Toda, K...

work page arXiv 2019

[10] [10]

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G¨olge, T. M¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” Jan. 2026. arXiv:2401.09512 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Open-Set Source Tracing of Audio Deepfake Systems,

N. Klein, H. Tak, and E. Khoury, “Open-Set Source Tracing of Audio Deepfake Systems,” inInterspeech 2025, pp. 1578–1582, ISCA, Aug. 2025

2025

[12] [12]

VIB-based Real Pre-emphasis Audio Deepfake Source Tracing,

T.-P. Doan, K. Hong, and S. Jung, “VIB-based Real Pre-emphasis Audio Deepfake Source Tracing,” inInterspeech 2025, pp. 1568–1572, ISCA, Aug. 2025

2025

[13] [13]

Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,

P. Falez, T. Marteau, D. Lolive, and A. Delhay, “Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,” inInterspeech 2025, pp. 1528–1532, ISCA, Aug. 2025

2025

[14] [14]

Syn- thetic Speech Source Tracing using Metric Learning,

D. Koutsianos, S. Zacharopoulos, Y . Panagakis, and T. Stafylakis, “Syn- thetic Speech Source Tracing using Metric Learning,” inInterspeech 2025, pp. 1558–1562, ISCA, Aug. 2025

2025

[15] [15]

Source Verification for Speech Deepfakes,

V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Source Verification for Speech Deepfakes,” inInterspeech 2025, pp. 1548–1552, ISCA, Aug. 2025

2025

[16] [16]

TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deepfakes,

A. Stan, D. Combei, D. Oneata, and H. Cucu, “TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deepfakes,” in Interspeech 2025, pp. 1543–1547, ISCA, Aug. 2025

2025

[17] [17]

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark,

X. Xuan, Y . Xiao, R. K. Das, and T. Kinnunen, “Multilingual Source Tracing of Speech Deepfakes: A First Benchmark,” Aug. 2025. arXiv:2508.04143 [eess]

work page arXiv 2025

[18] [18]

Seamless: Multilingual expressive and streaming speech translation,

S. Communication, L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, and N. Dong, “Seamless: Multilingual Expressive and Streaming Speech Translation,” Dec. 2023. arXiv:2312.05187 [cs.CL]

work page arXiv 2023

[19] [19]

Com- prehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection,

Y . E. Kheir, Y . Samih, S. Maharjan, T. Polzehl, and S. M ¨oller, “Com- prehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection,” Feb. 2025. arXiv:2502.03559 [eess]

work page arXiv 2025

[20] [20]

Detecting audio deepfakes on the edge: Lightweight ssl-based detection in a browser plugin,

O. Pascu, D. Oneata, H. Cucu, and N. M. M ¨uller, “Detecting audio deepfakes on the edge: Lightweight ssl-based detection in a browser plugin,” in2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 17–22, IEEE, 2025

2025

[21] [21]

Investigating self-supervised front ends for speech spoofing countermeasures,

X. Wang and J. Yamagishi, “Investigating self-supervised front ends for speech spoofing countermeasures,” inOdyssey 2022: The Speaker and Language Recognition Workshop, pp. 100–106, 2022

2022

[22] [22]

Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection,

J. M. Mart ´ın-Do˜nas, A. ´Alvarez, E. Rosello, A. M. Gomez, and A. M. Peinado, “Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection,” inInterspeech 2024, pp. 2085–2089, ISCA, Sept. 2024

2024

[23] [23]

Towards gen- eralisable and calibrated synthetic speech detection with self-supervised representations,

O. Pascu, A. Stan, D. Oneata, E. Oneata, and H. Cucu, “Towards gen- eralisable and calibrated synthetic speech detection with self-supervised representations,” inInterspeech, 2024

2024

[24] [24]

WavLM model ensemble for audio deepfake detection,

D. Combei, A. Stan, D. Oneata, and H. Cucu, “WavLM model ensemble for audio deepfake detection,”arXiv preprint arXiv:2408.07414, 2024

work page arXiv 2024

[25] [25]

Improved deepfake detection using Whisper features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved deepfake detection using Whisper features,” inInterspeech, 2023

2023

[26] [26]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, p. 1505–1518, Oct. 2022

2022

[27] [27]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021

2021

[28] [28]

Proxy Anchor Loss for Deep Metric Learning,

S. Kim, D. Kim, M. Cho, and S. Kwak, “Proxy Anchor Loss for Deep Metric Learning,” Mar. 2020. arXiv:2003.13911 [cs]

work page arXiv 2020

[29] [29]

Using MLAAD for source tracing of audio deepfakes

N. M ¨uller, “Using MLAAD for source tracing of audio deepfakes.” https: //deepfake-total.com/sourcetracing, 11 2024

2024

[30] [30]

Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,

A. Kulkarni, S. Dowerah, T. Alum ¨ae, and M. M. Doss, “Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,” inInterspeech 2025, pp. 1533–1537, ISCA, Aug. 2025

2025