pith. sign in

arxiv: 2606.10758 · v1 · pith:BPGIB6WGnew · submitted 2026-06-09 · 📡 eess.AS

Anchoring the Unknown: Open-Set Model Attribution via Proxy-Anchor Learning

Pith reviewed 2026-06-27 11:46 UTC · model grok-4.3

classification 📡 eess.AS
keywords TTS source attributionopen-set detectionProxy-Anchor lossWav2Vec2-BERTaudio forensicsOOD detectionmetric learningMLAAD dataset
0
0 comments X

The pith

Proxy-Anchor metric learning attributes 110 known TTS systems at 99.76 percent accuracy while flagging unseen ones at 2.04 percent false positive rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a metric learning approach can trace which text-to-speech system produced a given audio clip, even when the system has never been seen before. It builds an embedding space from Wav2Vec2-BERT features using the Proxy-Anchor loss, after first merging versions of the same TTS architecture into single training classes. If this holds, audio forensics gains the ability to both name the source among known generators and reliably mark unknown ones as out-of-distribution. The results are reported on a dataset of 140 systems across 51 languages, with a large improvement shown on an earlier public split.

Core claim

The central claim is that Proxy-Anchor loss applied to Wav2Vec2-BERT embeddings, combined with architecture merging for class design and post-hoc OOD scoring, forms an effective framework for closed-set TTS source attribution and open-set detection of unseen systems, demonstrated by 99.76 percent accuracy on 110 in-distribution classes and an FPR@95 of 2.04 percent on the MLAAD v9 dataset, plus nearly doubled OOD accuracy on MLAAD v5 splits.

What carries the argument

Proxy-Anchor loss operating on Wav2Vec2-BERT embeddings, with architecture merging to define unified classes and post-hoc scoring for out-of-distribution detection.

If this is right

  • Source attribution becomes feasible for more than 100 distinct TTS architectures in closed-set conditions.
  • Unknown TTS systems can be flagged without retraining when they appear in new audio.
  • The same embedding space supports both identification of known sources and rejection of unknown ones.
  • Architecture merging lowers inter-class confusion across multilingual data.
  • Performance gains on prior dataset splits indicate the framework improves on existing methods for open-set detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loss and embedding pipeline could be tested on other audio generation tasks such as voice conversion or music synthesis.
  • Adding more diverse or adversarial TTS examples would test whether the separation between known and unknown remains stable at scale.
  • The architecture merging step suggests a general preprocessing tactic for any domain where multiple versions of the same generator exist.
  • Real-time forensic pipelines could incorporate the OOD score as an alert threshold before attempting attribution.

Load-bearing premise

Merging TTS versions into architecture-based classes reduces confusion without discarding useful differences, and the learned embeddings plus scoring can separate known systems from truly unseen ones.

What would settle it

Testing the trained model on audio from a fresh set of TTS systems absent from both training and the MLAAD collection, then measuring whether the false positive rate for marking them out-of-distribution stays under 5 percent.

Figures

Figures reproduced from arXiv: 2606.10758 by Cristian-Teodor Neamtu, Dan Oneata, Dragos Burileanu, Horia Cucu, Serban Mihalache, Stefan Smeu.

Figure 1
Figure 1. Figure 1: t-distributed Stochastic Neighbor Embedding (t-SNE) visualization of embedding spaces. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

The proliferation of text-to-speech (TTS) systems capable of generating realistic synthetic speech poses growing challenges for audio forensics. While binary deepfake detection has received considerable attention, source tracing (i.e., identifying which TTS system produced a given audio sample) remains underexplored, particularly in open-set scenarios where unknown systems may be encountered. We propose a metric learning framework based on the Proxy-Anchor loss function that operates on Wav2Vec2-BERT embeddings to learn a discriminative embedding space for TTS source attribution and out-of-distribution (OOD) detection of unseen systems. We evaluate it on the MLAAD v9 dataset spanning 140 TTS systems across 51 languages, and introduce an architecture merging strategy that groups TTS system versions into unified classes, reducing inter-class confusion. Our system achieves 99.76% accuracy on 110 in-distribution classes and a False Positive Rate (FPR@95) as low as 2.04% for OOD detection. Also, for a fair comparison against the current state of the art, we further evaluate it on the MLAAD v5 official dataset splits, improving the OOD accuracy by almost doubling it. These results demonstrate that Proxy-Anchor metric learning, combined with architecture-aware class design and post-hoc OOD scoring, provides an effective framework for forensic TTS source tracing in both closed-set and open-set settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes a Proxy-Anchor metric learning framework operating on Wav2Vec2-BERT embeddings for closed-set TTS source attribution and open-set OOD detection of unseen systems. It introduces an architecture merging strategy that consolidates 140 TTS systems into 110 classes on the MLAAD v9 dataset, reporting 99.76% closed-set accuracy and 2.04% FPR@95 for OOD detection, along with nearly doubled OOD accuracy on the MLAAD v5 official splits relative to prior work.

Significance. If the central empirical claims hold after addressing the noted gaps, the work would provide a concrete, reproducible advance in open-set forensic attribution of synthetic speech by showing that proxy-anchor embeddings combined with architecture-aware class design can achieve high closed-set accuracy while maintaining low false-positive rates on unseen TTS systems. The use of public datasets and official splits is a positive factor for verifiability.

major comments (2)
  1. [Abstract and §4] Abstract and §4: The headline performance figures (99.76% accuracy on 110 classes, FPR@95 = 2.04%) rest on the architecture merging step, yet no ablation isolating the effect of merging (e.g., 140-class vs. 110-class embedding geometry or intra- vs. inter-architecture variance in the Wav2Vec2-BERT space) is reported; without this check the claim that merging reduces confusion without discarding discriminative cues remains unverified and load-bearing for both the closed-set and OOD results.
  2. [§4.2] §4.2 (or equivalent experimental subsection): The reported doubling of OOD accuracy on MLAAD v5 splits is presented without the exact baseline method name, its numerical score, the precise OOD metric definition, or statistical significance testing across data splits; this prevents direct assessment of whether the improvement is attributable to the proposed framework rather than post-hoc tuning of the free parameters (Proxy-Anchor margin/scale and merging thresholds).
minor comments (3)
  1. [Abstract] Abstract: The statement 'improving the OOD accuracy by almost doubling it' should be replaced by explicit numerical comparison (baseline value and proposed value) for precision.
  2. Throughout: No error bars, number of random seeds, or cross-validation details are mentioned for the reported accuracy and FPR figures, which is standard for empirical ML claims even if not load-bearing for the core derivation.
  3. [Method] Method section: The precise definition of the post-hoc OOD scoring function (e.g., distance threshold or density estimator) should be stated explicitly with equation reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback highlights opportunities to strengthen the experimental validation of the architecture merging strategy and the OOD comparison on MLAAD v5. We address each point below and commit to revisions that will make the claims more verifiable without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4: The headline performance figures (99.76% accuracy on 110 classes, FPR@95 = 2.04%) rest on the architecture merging step, yet no ablation isolating the effect of merging (e.g., 140-class vs. 110-class embedding geometry or intra- vs. inter-architecture variance in the Wav2Vec2-BERT space) is reported; without this check the claim that merging reduces confusion without discarding discriminative cues remains unverified and load-bearing for both the closed-set and OOD results.

    Authors: We agree that an ablation isolating the merging effect would make the contribution of this design choice more transparent. In the revised manuscript we will add a targeted analysis comparing the 140-class and 110-class configurations on the same Wav2Vec2-BERT embeddings, reporting intra- versus inter-architecture variance and nearest-neighbor confusion matrices. This will directly verify that merging reduces inter-class overlap while retaining discriminative information. revision: yes

  2. Referee: [§4.2] §4.2 (or equivalent experimental subsection): The reported doubling of OOD accuracy on MLAAD v5 splits is presented without the exact baseline method name, its numerical score, the precise OOD metric definition, or statistical significance testing across data splits; this prevents direct assessment of whether the improvement is attributable to the proposed framework rather than post-hoc tuning of the free parameters (Proxy-Anchor margin/scale and merging thresholds).

    Authors: We will expand §4.2 to name the exact baseline, quote its published OOD accuracy, specify the metric definition used for the doubling claim, and add results with standard deviation and significance testing across the official MLAAD v5 splits. These additions will allow readers to confirm that the observed improvement is attributable to the Proxy-Anchor framework rather than parameter selection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML pipeline on public dataset

full rationale

The paper reports closed-set accuracy (99.76% on 110 merged classes) and OOD FPR@95 (2.04%) from training a Proxy-Anchor model on Wav2Vec2-BERT embeddings of the MLAAD v9 dataset, plus a comparison on MLAAD v5 splits. No equations, derivations, or fitted-parameter predictions appear in the provided text; architecture merging is presented as an empirical preprocessing choice whose impact is measured directly on held-out data rather than defined into the metrics. Proxy-Anchor loss is a standard external reference, not a self-citation whose uniqueness theorem bears the central claim. The reported numbers are therefore falsifiable experimental outcomes, not quantities forced by construction from the paper's own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Standard supervised metric-learning setup relying on pre-trained embeddings and loss hyperparameters tuned to the target dataset; no new physical entities introduced.

free parameters (2)
  • Proxy-Anchor loss margin and scale parameters
    Typical hyperparameters in metric learning that are fitted or chosen on validation data to achieve the reported separation.
  • Architecture merging thresholds for grouping TTS versions
    Ad-hoc grouping rule introduced to reduce inter-class confusion; exact criteria not specified in abstract.
axioms (1)
  • domain assumption Wav2Vec2-BERT embeddings contain sufficient information to discriminate TTS system identity
    The framework presupposes that features from this particular pre-trained model are suitable for the attribution task.

pith-pipeline@v0.9.1-grok · 5800 in / 1274 out tokens · 36401 ms · 2026-06-27T11:46:32.302787+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,

    C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,” Jan

  2. [2]

    arXiv:2301.02111 [cs]

  3. [3]

    NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,

    X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y . Liu, “NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality,” May

  4. [4]

    arXiv:2205.04421 [eess]

  5. [5]

    YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,

    E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. G ¨olge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” Apr. 2023. arXiv:2112.02418 [cs]

  6. [6]

    ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,

    X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023. arXiv:2210.02437 [cs]

  7. [7]

    Source Tracing of Audio Deepfake Systems,

    N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury, “Source Tracing of Audio Deepfake Systems,” inInterspeech 2024, pp. 1100–1104, Sept

  8. [8]

    arXiv:2407.08016 [eess]

  9. [9]

    ASVspoof 2019: A large-scale public database of synthesized, con- verted and replayed speech,

    X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.- C. Huang, T. Toda, K...

  10. [10]

    MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

    N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G¨olge, T. M¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,” Jan. 2026. arXiv:2401.09512 [cs]

  11. [11]

    Open-Set Source Tracing of Audio Deepfake Systems,

    N. Klein, H. Tak, and E. Khoury, “Open-Set Source Tracing of Audio Deepfake Systems,” inInterspeech 2025, pp. 1578–1582, ISCA, Aug. 2025

  12. [12]

    VIB-based Real Pre-emphasis Audio Deepfake Source Tracing,

    T.-P. Doan, K. Hong, and S. Jung, “VIB-based Real Pre-emphasis Audio Deepfake Source Tracing,” inInterspeech 2025, pp. 1568–1572, ISCA, Aug. 2025

  13. [13]

    Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,

    P. Falez, T. Marteau, D. Lolive, and A. Delhay, “Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,” inInterspeech 2025, pp. 1528–1532, ISCA, Aug. 2025

  14. [14]

    Syn- thetic Speech Source Tracing using Metric Learning,

    D. Koutsianos, S. Zacharopoulos, Y . Panagakis, and T. Stafylakis, “Syn- thetic Speech Source Tracing using Metric Learning,” inInterspeech 2025, pp. 1558–1562, ISCA, Aug. 2025

  15. [15]

    Source Verification for Speech Deepfakes,

    V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Source Verification for Speech Deepfakes,” inInterspeech 2025, pp. 1548–1552, ISCA, Aug. 2025

  16. [16]

    TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deepfakes,

    A. Stan, D. Combei, D. Oneata, and H. Cucu, “TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deepfakes,” in Interspeech 2025, pp. 1543–1547, ISCA, Aug. 2025

  17. [17]

    Multilingual Source Tracing of Speech Deepfakes: A First Benchmark,

    X. Xuan, Y . Xiao, R. K. Das, and T. Kinnunen, “Multilingual Source Tracing of Speech Deepfakes: A First Benchmark,” Aug. 2025. arXiv:2508.04143 [eess]

  18. [18]

    Seamless: Multilingual expressive and streaming speech translation,

    S. Communication, L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, and N. Dong, “Seamless: Multilingual Expressive and Streaming Speech Translation,” Dec. 2023. arXiv:2312.05187 [cs.CL]

  19. [19]

    Com- prehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection,

    Y . E. Kheir, Y . Samih, S. Maharjan, T. Polzehl, and S. M ¨oller, “Com- prehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection,” Feb. 2025. arXiv:2502.03559 [eess]

  20. [20]

    Detecting audio deepfakes on the edge: Lightweight ssl-based detection in a browser plugin,

    O. Pascu, D. Oneata, H. Cucu, and N. M. M ¨uller, “Detecting audio deepfakes on the edge: Lightweight ssl-based detection in a browser plugin,” in2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 17–22, IEEE, 2025

  21. [21]

    Investigating self-supervised front ends for speech spoofing countermeasures,

    X. Wang and J. Yamagishi, “Investigating self-supervised front ends for speech spoofing countermeasures,” inOdyssey 2022: The Speaker and Language Recognition Workshop, pp. 100–106, 2022

  22. [22]

    Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection,

    J. M. Mart ´ın-Do˜nas, A. ´Alvarez, E. Rosello, A. M. Gomez, and A. M. Peinado, “Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection,” inInterspeech 2024, pp. 2085–2089, ISCA, Sept. 2024

  23. [23]

    Towards gen- eralisable and calibrated synthetic speech detection with self-supervised representations,

    O. Pascu, A. Stan, D. Oneata, E. Oneata, and H. Cucu, “Towards gen- eralisable and calibrated synthetic speech detection with self-supervised representations,” inInterspeech, 2024

  24. [24]

    WavLM model ensemble for audio deepfake detection,

    D. Combei, A. Stan, D. Oneata, and H. Cucu, “WavLM model ensemble for audio deepfake detection,”arXiv preprint arXiv:2408.07414, 2024

  25. [25]

    Improved deepfake detection using Whisper features,

    P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Improved deepfake detection using Whisper features,” inInterspeech, 2023

  26. [26]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, p. 1505–1518, Oct. 2022

  27. [27]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

    W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021

  28. [28]

    Proxy Anchor Loss for Deep Metric Learning,

    S. Kim, D. Kim, M. Cho, and S. Kwak, “Proxy Anchor Loss for Deep Metric Learning,” Mar. 2020. arXiv:2003.13911 [cs]

  29. [29]

    Using MLAAD for source tracing of audio deepfakes

    N. M ¨uller, “Using MLAAD for source tracing of audio deepfakes.” https: //deepfake-total.com/sourcetracing, 11 2024

  30. [30]

    Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,

    A. Kulkarni, S. Dowerah, T. Alum ¨ae, and M. M. Doss, “Unveiling Audio Deepfake Origins: A Deep Metric learning And Conformer Network Approach With Ensemble Fusion,” inInterspeech 2025, pp. 1533–1537, ISCA, Aug. 2025