pith. sign in

arxiv: 2606.11666 · v1 · pith:M5SXBU3Rnew · submitted 2026-06-10 · 💻 cs.SD

The Hidden Cost of Pairwise Verification in Synthetic Speech Source Tracing

Pith reviewed 2026-06-27 08:35 UTC · model grok-4.3

classification 💻 cs.SD
keywords synthetic speechsource tracingpairwise verificationglobal anchoringequal error rateembedding spaceopen-set identification
0
0 comments X

The pith

Global anchoring achieves lower error than pairwise verification for synthetic speech source tracing because it shapes embedding directions differently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether framing open-set source tracing as a verification problem justifies using pairwise metric-learning objectives from biometrics. Under matched backbones, data, and epoch budgets on MLAAD in-domain and STOPA out-of-domain sets, global anchoring reaches 8.61% EER while pairwise variants reach 12-15% EER even after rival mining and XLS-R finetuning. Pairwise training concentrates variance into fewer directions and thereby loses resolution among closely related generators. The authors impose an artificial bottleneck on the global baseline and still find it competitive, and k99 analysis of the embedding space supports that the objective itself, not dimensionality alone, drives the difference.

Core claim

Under matched backbones and a fixed data and epoch budget on MLAAD and STOPA, global anchoring yields lower in-domain error (8.61% EER) than pairwise variants (12-15% EER). Because pairwise objectives optimize similarity directly, they concentrate variance into fewer embedding directions, reducing resolution among closely related generators. Imposing a similar bottleneck on the globally supervised baseline leaves it competitive, and embedding-space analysis with k99 shows the gap arises from how the pairwise objective shapes the retained directions rather than from dimensionality reduction alone.

What carries the argument

The pairwise verification objective versus global anchoring, together with k99 analysis of retained variance directions in the embedding space.

If this is right

  • Pairwise objectives reduce the ability to distinguish closely related synthetic speech generators even when advanced backbones and mining are used.
  • Global anchoring preserves more useful embedding directions under the same training constraints.
  • Imposing dimensionality bottlenecks on global models does not erase their advantage over pairwise ones.
  • The performance difference appears in in-domain results and is not explained by dimensionality reduction alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tasks that require fine distinctions among similar classes may benefit from global supervision over standard pairwise metric learning.
  • Similar objective comparisons could be run on other open-set attribution problems such as image or video generator tracing.
  • Hybrid objectives that combine global and pairwise terms might retain the strengths of both.
  • Varying the data or compute budget could test whether the observed gap shrinks under different optimization regimes.

Load-bearing premise

The fixed data and epoch budget plus matched backbones create equivalent optimization difficulty for global versus pairwise objectives.

What would settle it

Training pairwise models for many more epochs or with a larger data budget until their EER matches or beats the global baseline would show the gap was due to unequal training regimes rather than the objective itself.

Figures

Figures reproduced from arXiv: 2606.11666 by Anton Firc, Kamil Malinka, Vojt\v{e}ch Stan\v{e}k, Zbyn\v{e}k Li\v{c}ka.

Figure 1
Figure 1. Figure 1: plots DET curves for the best Global and Pairwise systems on MLAAD and STOPA, highlighting behavior at strict low-FPR operating points. Additional pairwise ablations. Beyond the reported BCE pairwise variants, we tested additional objectives and initial￾ization variants, including raw cosine similarity with a margin loss and initializing pairwise training from global (CE) base￾lines instead of training fro… view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative variance analysis demonstrating dimen￾sionality collapse [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Score CDFs on MLAAD for Global (CE) embneck13 (top) and Pairwise Rival + XLS-R finetune (bottom). We plot target CDF and non-target 1 − CDF. Diamonds mark means; vertical lines mark 95% and 99% quantiles. on the acoustic similarity of the sources. 1. Shared Limitations (Digital Twins). For pairs shar￾ing identical architectures and training data (e.g., VITS vs. VITS-Neon), both objectives exhibit high erro… view at source ↗
read the original abstract

Open-set source tracing is increasingly framed as a verification problem, motivating the use of pairwise metric-learning objectives from biometrics. We thus compare global anchoring and pairwise verification under matched backbones and a fixed data and epoch budget on MLAAD (in-domain) and STOPA (out-of-domain). In our runs, global anchoring yields lower in-domain error (8.61% EER) than pairwise variants (12-15% EER), even with rival mining and XLS-R finetuning. Because pairwise objectives optimize similarity directly, they concentrate variance into fewer embedding directions, reducing resolution among closely related generators. To test if this drives the drop, we impose a similar bottleneck to the globally supervised baseline, yet the baseline remains competitive. Together with an embedding-space analysis ($k_{99}$), these results suggest that the gap is not explained by dimensionality alone, but rather by the pairwise objective's shaping of the retained directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that global anchoring yields lower in-domain EER (8.61%) than pairwise verification objectives (12-15% EER) for open-set synthetic speech source tracing on MLAAD (in-domain) and STOPA (out-of-domain), even under matched backbones, fixed data/epoch budgets, rival mining, and XLS-R finetuning. It attributes the gap not to dimensionality but to pairwise objectives shaping retained embedding directions (reducing resolution among similar generators), supported by a bottleneck ablation on the global baseline and an embedding-space k99 analysis.

Significance. If robust, the result cautions against direct transfer of pairwise metric-learning objectives from biometrics to source tracing, as they may concentrate variance and limit discriminative power. Strengths include the empirical isolation of dimensionality via bottleneck test and the independent k99 measurement on learned embeddings; the work is purely empirical with no circular or self-referential derivations.

major comments (1)
  1. [abstract, paragraph on experimental setup] Abstract, paragraph on experimental setup: the central claim that the EER gap (and k99 results) stems from the pairwise objective's shaping of retained directions requires that the fixed data/epoch budget and matched backbones produce comparably converged models. Pairwise verification with rival mining is known to be sensitive to margin, temperature, and mining frequency; without training curves, validation monitoring, or evidence that extra epochs/retuning would not close the gap, the attribution to objective shape rather than optimization difficulty remains untested. The bottleneck ablation addresses dimensionality but not this equivalence.
minor comments (2)
  1. Notation: 'k99' should be rendered as k_{99} for clarity in the embedding-space analysis description.
  2. The abstract refers to 'in our runs' but provides no details on number of runs, random seeds, or statistical significance testing for the reported EER differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on convergence equivalence. We address the concern directly below and agree that additional evidence would strengthen the attribution of the EER gap to objective shape rather than optimization differences.

read point-by-point responses
  1. Referee: the central claim that the EER gap (and k99 results) stems from the pairwise objective's shaping of retained directions requires that the fixed data/epoch budget and matched backbones produce comparably converged models. Pairwise verification with rival mining is known to be sensitive to margin, temperature, and mining frequency; without training curves, validation monitoring, or evidence that extra epochs/retuning would not close the gap, the attribution to objective shape rather than optimization difficulty remains untested. The bottleneck ablation addresses dimensionality but not this equivalence.

    Authors: We agree that the fixed budget alone does not prove comparable convergence and that sensitivity of pairwise verification to hyperparameters could affect results. In our experiments the epoch count was selected after preliminary runs in which validation EER for both objectives stabilized, but we did not report the curves. In revision we will add training and validation loss/EER curves for all methods under the reported budget, together with the hyperparameter search details (margin, temperature, mining frequency) used for the pairwise variants. If the curves confirm that both families reach plateau performance, this will support the claim that the gap arises from retained embedding directions rather than under-optimization. The bottleneck ablation will remain as evidence that dimensionality is not the sole factor. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparison with independent measurements

full rationale

The paper reports experimental EER results (8.61% vs 12-15%) and k99 embedding analysis under matched backbones and fixed budgets. No derivation, equation, or self-citation chain reduces any reported quantity to a fitted parameter or input by construction. The bottleneck ablation and out-of-domain tests are external measurements on the learned embeddings, not tautological. The central claim rests on observed performance gaps rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the chosen datasets and fixed training budget are representative of the open-set source-tracing problem and that EER is an appropriate scalar summary; no new mathematical axioms or invented entities are introduced. The only free parameters are the standard training hyperparameters implicit in any neural-network experiment.

pith-pipeline@v0.9.1-grok · 5708 in / 1227 out tokens · 14378 ms · 2026-06-27T08:35:41.004385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 2 canonical work pages

  1. [1]

    Recently, Source Tracing has been explored to provide post- incident countermeasures by attributing the attack to the syn- thesizer used to create the deepfake

    Introduction As speech synthesis approaches evolved into a substantial se- curity threat [1, 2, 3, 4], audio deepfake detection provided a sufficient countermeasure against deepfake-related incidents. Recently, Source Tracing has been explored to provide post- incident countermeasures by attributing the attack to the syn- thesizer used to create the deepf...

  2. [2]

    Borrelli et al

    From Classification to Verification Early forensic works framed source tracing as a closed-set classification problem. Borrelli et al. [12] pioneered the use of SVMs to distinguish generator architectures, while subse- quent studies decomposed the task into component-level anal- ysis, classifying specific vocoders or acoustic models sepa- rately [13, 14]....

  3. [3]

    Hypothesis and Comparison Strategy We study how training objectives affect verification perfor- mance and the geometry of representation for open-set source tracing

    Experimental Framework 3.1. Hypothesis and Comparison Strategy We study how training objectives affect verification perfor- mance and the geometry of representation for open-set source tracing. Motivated by the success of pairwise learning in bio- metrics [5, 6], we test whether pairwise objectives improve gen- eralization or trade off fine-grained resolu...

  4. [4]

    Global anchoring (Baseline).We re-implement the attribution-based verification framework established by Ne- groni et al. [11]. This approach treats open-set verification as a representation learning problem via closed-set classification. The model projects the pooled embeddinghto class logits via a linear layer and is optimized using Softmax Cross-Entropy...

  5. [5]

    Pairwise Verification.Pairwise systems replace the classifi- cation head with a fusion module that maps an embedding pair (ha, hb)to a scalar similarity score, estimating the probability that the two samples were generated by the same generator. We compare four trial selection regimes: •Intermediate (Random):A baseline regime sampling anchor-positive pair...

  6. [6]

    impostor pairs

    Experiments and Results 4.1. Establishing the Pairwise Baseline Unless stated otherwise, we use the pairwise defaults tuned on MLAAD-dev (3 seeds), namelyintermediatesampling and XLS-R+MHFA+FFCosine, and keep the training budget fixed thereafter (full sweeps in the supplement). Table 1:Global vs. pairwise objectives with bottleneck and backbone controls.M...

  7. [7]

    VITS-Neon), both objectives exhibit high error rates

    Shared Limitations (Digital Twins).For pairs shar- ing identical architectures and training data (e.g.,VITSvs. VITS-Neon), both objectives exhibit high error rates. As in- dicated by the binary probe experiments (Section 4.5), these sources appear topologically overlapping in the XLS-R feature space. The inability to distinguish them reflects a limitation...

  8. [8]

    Bark-Small

    Loss of Resolution (Architectural Cousins).A critical divergence appears for systems that share an architecture but differ in configuration, such asMulti-Dataset-Barkvs. Bark-Small. While the Global Model retains sufficient dis- criminability to separate these variants in our setting (480 er- rors), the Pairwise Model exhibits a nearly threefold increase ...

  9. [9]

    Conclusion Across our experiments, global supervision remains a strong baseline for open-set synthetic speech attribution: it achieves the best in-domain verification on MLAAD, while the tested pairwise variants improve with mining and SSL finetuning but do not match it. Diagnostics of the learned representations suggest that the gap is not explained by e...

  10. [10]

    Acknowledgements This work was partially supported by the Brno University of Technology (internal project FIT-S-23-8151) and the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254)

  11. [11]

    The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

    Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI Models (specifically Google Gemini, ChatGPT, and Gram- marly) for language editing and text refinement. The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

  12. [12]

    Assessing the human ability to recognize synthetic speech in ordinary conversation,

    D. Prudk ´y, A. Firc, and K. Malinka, “Assessing the human ability to recognize synthetic speech in ordinary conversation,” in2023 International Conference of the Biometrics Special Interest Group (BIOSIG), 2023, pp. 1–5

  13. [13]

    The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,

    A. Firc and K. Malinka, “The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,” ser. SAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1646–1655. [Online]. Available: https://doi.org/10.1145/3477314.3507013

  14. [14]

    Deepfakes as a threat to a speaker and facial recognition: an overview of tools and attack vectors,

    A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfakes as a threat to a speaker and facial recognition: an overview of tools and attack vectors,”Heliyon, vol. 9, no. 4, pp. 1–33, april 2023. [Online]. Available: https://www.fit.vut.cz/research/publication/12850

  15. [15]

    Resilience of voice assistants to synthetic speech,

    K. Malinka, A. Firc, P. Kaˇska, T. Lapˇsansk´y, O. ˇSandor, and I. Ho- moliak, “Resilience of voice assistants to synthetic speech,” in Computer Security – ESORICS 2024, J. Garcia-Alfaro, R. Kozik, M. Chora ´s, and S. Katsikas, Eds. Cham: Springer Nature Switzerland, 2024, pp. 66–84

  16. [16]

    Facenet: A unified embedding for face recognition and clustering,

    F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2015

  17. [17]

    Deep speaker: an end-to-end neural speaker embedding system,

    C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y . Cao, A. Kannan, and Z. Zhu, “Deep speaker: an end-to-end neural speaker embedding system,” 2017. [Online]. Available: https://arxiv.org/abs/1705.02304

  18. [18]

    Synthetic Speech Source Tracing using Metric Learning,

    D. Koutsianos, S. Zacharopoulos, Y . Panagakis, and T. Stafylakis, “Synthetic Speech Source Tracing using Metric Learning,” inIn- terspeech 2025, 2025, pp. 1558–1562

  19. [19]

    TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deep- fakes,

    A. Stan, D. Combei, D. Oneata, and H. Cucu, “TADA: Training- free Attribution and Out-of-Domain Detection of Audio Deep- fakes,” inInterspeech 2025, 2025, pp. 1543–1547

  20. [20]

    Mlaad: The multi-language audio anti-spoofing dataset,

    N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “Mlaad: The multi-language audio anti-spoofing dataset,”arXiv preprint arXiv:2401.09512, 2024

  21. [21]

    STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,

    A. Firc, M. Chhibber, J. Mishra, V . Pratap Singh, T. Kinnunen, and K. Malinka, “STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,” in Interspeech 2025, 2025, pp. 1553–1557

  22. [22]

    Source Ver- ification for Speech Deepfakes ,

    V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “ Source Ver- ification for Speech Deepfakes ,” inInterspeech 2025, 2025, pp. 1548–1552

  23. [23]

    Synthetic speech detection through short-term and long-term prediction traces,

    C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, “Synthetic speech detection through short-term and long-term prediction traces,”EURASIP Journal on Information Security, vol. 2021, no. 1, p. 2, Apr 2021. [Online]. Available: https://doi.org/10.1186/s13635-021-00116-3

  24. [24]

    Source tracing: Detect- ing voice spoofing,

    T. Zhu, X. Wang, X. Qin, and M. Li, “Source tracing: Detect- ing voice spoofing,” in2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, pp. 216–220

  25. [25]

    Source Trac- ing of Audio Deepfake Systems,

    N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury, “Source Trac- ing of Audio Deepfake Systems,” inInterspeech 2024, 2024, pp. 1100–1104

  26. [26]

    Generalize audio deepfake algorithm recognition via attribution enhancement,

    Z. Wang, D. Ye, J. Li, and J. Deng, “Generalize audio deepfake algorithm recognition via attribution enhancement,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  27. [27]

    Source tracing of synthetic speech systems through paralinguistic pre-trained repre- sentations,

    Girish, M. M. Akhtar, O. C. Phukan, D. Singh, S. R. Behera, P. B. Reddy, A. B. Buduru, and R. Sharma, “Source tracing of synthetic speech systems through paralinguistic pre-trained repre- sentations,” in2025 33rd European Signal Processing Conference (EUSIPCO), 2025, pp. 496–500

  28. [28]

    Investigating prosodic signatures via speech pre-trained models for audio deepfake source attribution,

    O. Chetia Phukan, D. Singh, S. R. Behera, A. B. Buduru, and R. Sharma, “Investigating prosodic signatures via speech pre-trained models for audio deepfake source attribution,” in Findings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Lingu...

  29. [29]

    Towards neural audio codec source parsing,

    O. C. Phukan, Girish, M. M. Akhtar, A. B. Buduru, and R. Sharma, “Towards neural audio codec source parsing,” 2025. [Online]. Available: https://arxiv.org/abs/2506.12627

  30. [30]

    Advancing zero- shot open-set speech deepfake source tracing,

    M. Chhibber, J. Mishra, and T. H. Kinnunen, “Advancing zero- shot open-set speech deepfake source tracing,” 2025. [Online]. Available: https://arxiv.org/abs/2509.24674

  31. [31]

    Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,

    P. Falez, T. Marteau, D. Lolive, and A. Delhay, “Audio Deepfake Source Tracing using Multi-Attribute Open-Set Identification and Verification,” inInterspeech 2025, 2025, pp. 1528–1532

  32. [32]

    Margin mat- ters: Towards more discriminative deep neural network embed- dings for speaker recognition,

    X. Xiang, S. Wang, H. Huang, Y . Qian, and K. Yu, “Margin mat- ters: Towards more discriminative deep neural network embed- dings for speaker recognition,” in2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Confer- ence (APSIPA ASC), 2019, pp. 1652–1656

  33. [33]

    Xls-r: Self-supervised cross-lingual speech representation learning at scale,

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Con- neau, and M. Auli, “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” inInterspeech 2022, 2022, pp. 2278–2282

  34. [34]

    Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,

    J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6367–6371

  35. [35]

    Ca-mhfa: A context-aware multi-head fac- torized attentive pooling for ssl-based speaker verification,

    J. Peng, L. Mo ˇsner, L. Zhang, O. Plchot, T. Stafylakis, L. Bur- get, and J. ˇCernock´y, “Ca-mhfa: A context-aware multi-head fac- torized attentive pooling for ssl-based speaker verification,” in ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  36. [36]

    What data enables optimal decisions? an exact characterization for linear optimization,

    O. Bennouna, A. Bennouna, S. Amin, and A. Ozdaglar, “What data enables optimal decisions? an exact characterization for linear optimization,” 2025. [Online]. Available: https: //arxiv.org/abs/2505.21692