SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

Anton Firc; Kamil Malinka; Martin Pere\v{s}\'ini; Vojt\v{e}ch Stan\v{e}k; Zbyn\v{e}k Li\v{c}ka

arxiv: 2606.11674 · v1 · pith:AC2UWLBEnew · submitted 2026-06-10 · 💻 cs.SD · cs.LG

SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

Anton Firc , Vojt\v{e}ch Stan\v{e}k , Zbyn\v{e}k Li\v{c}ka , Kamil Malinka , Martin Pere\v{s}\'ini This is my paper

Pith reviewed 2026-06-27 08:33 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords anti-spoofinggraph poolingAASISTsparsificationefficiencyrobustnessSSL-based detectionmodel compression

0 comments

The pith

Replacing learned pooling in AASIST with magnitude-based scoring and separate train-inference ratios reduces backend compute by 21 percent while raising out-of-domain robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the standard AASIST graph pooling backend contains redundant learned operations that can be replaced by simpler explicit rules. Using magnitude-based node scoring, mean aggregation, and distinct pooling ratios for training and inference produces a smaller, faster model. This version maintains or improves detection accuracy on the ASVspoof5 benchmark and delivers clearer gains on the more challenging In-the-Wild dataset. The authors also supply a single composite score that balances accuracy, calibration, and compute cost to guide deployment choices.

Core claim

SpAArSIST replaces the learned pooling and stack-node attention inside the AASIST backend with magnitude-based node scoring, mean aggregation, and two separate graph pooling ratios (k_tr during training, k_inf at inference). The top-ranked configuration lowers backend MACs from 195.045 M to 154.706 M and parameters from 611.8 k to 586.4 k, while lowering equal-error rate on In-the-Wild from 4.64 percent to 2.82 percent and minDCF from 0.133 to 0.078, remaining competitive on ASVspoof5.

What carries the argument

magnitude-based node scoring together with mean aggregation and separate training versus inference pooling ratios (k_tr, k_inf)

If this is right

The sparsified backend requires 20.7 percent fewer multiply-accumulate operations and 4.1 percent fewer parameters.
Detection performance on unseen real-world recordings improves rather than degrades.
A single composite score now exists that ranks models by joint accuracy, calibration, and compute cost.
The same explicit replacement pattern can be applied to other graph-based audio front-ends that currently rely on learned pooling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparsification pattern could be tested on graph backends used for speaker verification or emotion recognition to check whether efficiency gains transfer.
Real-time voice-assistant pipelines could adopt the lighter model to lower latency and power draw on edge devices without retraining the upstream SSL encoder.
If the magnitude-based rule proves stable across future attack types, it reduces the need to re-learn pooling weights whenever the training distribution shifts.

Load-bearing premise

That swapping learned pooling and attention for magnitude scoring, mean aggregation, and fixed separate ratios will not create new failure modes on data distributions never seen during training.

What would settle it

A new out-of-domain spoofing corpus on which the sparsified model records higher EER or minDCF than the original AASIST backend would falsify the robustness improvement.

read the original abstract

We present SpAArSIST, a deployment-oriented refinement of the widely used AASIST graph pooling backend for self-supervised learning (SSL) based anti-spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack-node attention with explicit, lightweight choices: separate train and inference graph pooling ratios $(k_{\mathrm{tr}},k_{\mathrm{inf}})$, magnitude-based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M $\rightarrow$ 154.706M MACs) and model size by 4.1% (611.8k $\rightarrow$ 586.4k params), while improving out-of-domain robustness on In-the-Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment-oriented model choice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpAArSIST is a practical sparsification of AASIST that swaps learned pooling for magnitude scoring and separate train/inference ratios, delivering reported 20% MACs cut and better In-the-Wild EER.

read the letter

The core contribution is a set of explicit substitutions in the AASIST graph backend: magnitude-based node scoring instead of learned pooling, mean aggregation, and distinct k_tr and k_inf ratios. The best configuration they report trims backend compute from 195M to 155M MACs and trims parameters slightly while improving EER on In-the-Wild from 4.64% to 2.82% and staying competitive on ASVspoof5. They also supply a composite score that folds accuracy, calibration, and compute together.

The work is straightforward engineering. It starts from observed redundancy in public AASIST code and replaces the learned pieces with lightweight rules that are easy to implement and tune. The numbers on MACs, parameter count, and the two evaluation sets are concrete enough to be useful for anyone sizing a deployment.

The main soft spot is verification of the central assumption. The robustness gain on out-of-domain data is the strongest claim, yet it rests on the magnitude heuristic and mean aggregation preserving the nodes that actually matter for spoof detection. Without node-selection comparisons or ablations that isolate the effect of the substitutions, it is possible the gains trace to other unmentioned changes in training or front-end handling. The abstract gives no error bars or statistical tests, so the size of the improvement is hard to judge precisely.

This paper is for practitioners who already run AASIST or similar graph backends and need lower inference cost without a full redesign. It is not a broad methodological advance, but the efficiency and OOD numbers are the kind of thing that matters in voice-security pipelines.

I would send it to peer review. The claims are testable, the motivation is clear, and the deployment angle is worth referee scrutiny even if the final verdict depends on the methods and ablation sections.

Referee Report

2 major / 1 minor

Summary. The paper introduces SpAArSIST, a sparsified refinement of the AASIST graph pooling backend for SSL-based anti-spoofing. Motivated by redundancy in public implementations, it replaces learned pooling and stack-node attention with magnitude-based node scoring, mean aggregation, and separate train/inference pooling ratios (k_tr, k_inf). The best configuration reduces backend MACs by 20.7% (195.045M to 154.706M) and parameters by 4.1% (611.8k to 586.4k), while improving In-the-Wild EER to 2.82% and minDCF to 0.078 (from 4.64% and 0.133) and remaining competitive on ASVspoof5. A composite selection score combining accuracy, calibration, and compute is also proposed.

Significance. If the central performance claims hold after verification, the work supplies a practical, deployment-oriented backend that lowers compute while strengthening out-of-domain robustness, with explicit lightweight substitutions that aid reproducibility. The composite score for balanced model selection is a useful addition for real-world anti-spoofing systems.

major comments (2)

[§3 (Methods)] The central claim that the substitutions (magnitude-based scoring, mean aggregation, separate k_tr/k_inf) preserve or improve spoof detection on OOD data rests on an unverified assumption. No direct comparison is provided between nodes selected by the original learned pooling versus the magnitude heuristic, particularly on In-the-Wild distributions where low-magnitude nodes may carry task-relevant information.
[§4 (Experiments)] Table reporting the rank-1 configuration and In-the-Wild results: the EER/minDCF gains (2.82%/0.078 vs 4.64%/0.133) lack error bars, multiple runs, or statistical significance tests, and no ablation isolates the contribution of each substitution, weakening support for attributing the 20.7% MAC reduction and robustness improvement specifically to the proposed changes.

minor comments (1)

[Abstract] The composite selection score is referenced but its precise weighting formula and normalization are not shown in the provided abstract or summary; include the explicit definition in the main text for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review of our manuscript on SpAArSIST. We address each major comment below with clarifications and indicate planned revisions where they strengthen the work without altering core claims.

read point-by-point responses

Referee: [§3 (Methods)] The central claim that the substitutions (magnitude-based scoring, mean aggregation, separate k_tr/k_inf) preserve or improve spoof detection on OOD data rests on an unverified assumption. No direct comparison is provided between nodes selected by the original learned pooling versus the magnitude heuristic, particularly on In-the-Wild distributions where low-magnitude nodes may carry task-relevant information.

Authors: The end-to-end results on In-the-Wild data verify the claim: SpAArSIST improves OOD robustness over the original AASIST (EER 2.82% vs 4.64%, minDCF 0.078 vs 0.133). This shows the substitutions preserve and enhance task-relevant information for spoof detection. The magnitude heuristic is motivated by observed redundancy in learned pooling; prioritizing high-magnitude nodes yields better robustness, even if some low-magnitude nodes carry information. A direct node-selection comparison would add interpretability but is not required to support the performance-based validation of the lightweight alternative. revision: no
Referee: [§4 (Experiments)] Table reporting the rank-1 configuration and In-the-Wild results: the EER/minDCF gains (2.82%/0.078 vs 4.64%/0.133) lack error bars, multiple runs, or statistical significance tests, and no ablation isolates the contribution of each substitution, weakening support for attributing the 20.7% MAC reduction and robustness improvement specifically to the proposed changes.

Authors: We agree that error bars, multiple runs, and component ablations would improve rigor. The reported gains come from the best configuration found via hyperparameter search, with consistent improvements across metrics and the proposed composite score. In revision we will add multiple-run results with standard deviations and a dedicated ablation isolating magnitude scoring, mean aggregation, and separate k_tr/k_inf ratios to better attribute the 20.7% MAC reduction and robustness gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical sparsification with direct measurements

full rationale

The paper describes an empirical refinement of the AASIST backend by substituting learned pooling and attention with magnitude-based scoring, mean aggregation, and separate (k_tr, k_inf) ratios. All reported gains (MACs, parameter count, EER, minDCF) are obtained from explicit experimental comparisons against the baseline on ASVspoof5 and In-the-Wild data. No equations, fitted parameters, or self-citations are invoked to derive the performance numbers by construction; the substitutions are motivated by redundancy observations and validated through measurement. The derivation chain is therefore self-contained and externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that explicit lightweight operations can substitute for learned pooling without performance loss. No new physical entities or unstated mathematical axioms are introduced beyond standard graph neural network assumptions.

free parameters (1)

k_tr and k_inf
Explicit graph pooling ratios chosen separately for training and inference; values not reported in abstract but treated as design choices.

pith-pipeline@v0.9.1-grok · 5759 in / 1160 out tokens · 20093 ms · 2026-06-27T08:33:37.830660+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages

[1]

Introduction Speech spoofing and deepfake generation methods continue to improve in quality and diversity, increasing the de- mand for robust, deployable spoofing countermeasures [1, 2, 3]. Modern detection pipelines typically combine a pre- trained self-supervised learning front-end (e.g.,XLS-R[4], Wav2Vec2.0[5] orWavLM[6]), with a learnable pooling back...
[2]

Related Work Anti-spoofing countermeasures:Early countermeasures re- lied on handcrafted acoustic features and shallow classi- fiers [17]. Deep learning approaches later improved separation by learning robust representations directly from waveforms or spectrograms, often aided by data augmentation [18] and chal- lenge protocols [19, 20, 21]. Deep learning...

Pith/arXiv arXiv 2026
[3]

For an input utterance, the processing can be summarized as: 1.Front-end feature extraction:a pretrained SSL encoder produces frame-level representations

Methodology AASIST[8] follows a graph-based pooling backend on top of an SSL front-end. For an input utterance, the processing can be summarized as: 1.Front-end feature extraction:a pretrained SSL encoder produces frame-level representations. 2.Graph construction:the utterance is mapped to graph nodes (frames or regions), typically forming parallel spectr...
[4]

lower is better

Experiment Setup 4.1. Architectural Backbone and Feature Pipeline Our experiments utilize a unified Self-Supervised Learn- ing (SSL) frontend. We employ theWav2Vec2.0 XLS-R (300M)front-end to generate frame-level represen- tations. These features are subsequently integrated by an AASIST/SpAArSISTpooling layer, resulting in a fixed-size utterance embedding...
[5]

The top-ranked systems pair magnitude-based scoring with more aggressive pruning (ktr = 0.3,k inf = 0.1), remaining competitive on ASVspoof 5 while improving robustness on ITW

Results Table 1 reports representativeAASIST/SpAArSISTconfigu- rations, metrics, and composite scores. The top-ranked systems pair magnitude-based scoring with more aggressive pruning (ktr = 0.3,k inf = 0.1), remaining competitive on ASVspoof 5 while improving robustness on ITW. Table 2 compares pooling backends under the sameXLS-R front end and training ...

arXiv 2094
[6]

Conclusion We proposedSpAArSIST, a deployment-oriented simplifica- tion of theAASISTgraph backend in anXLS-Rpipeline. Our results indicate that several commonly used graph components are not essential for robust spoofing detection: the stack-node attention behaves close to mean aggregation, and temperature tuning does not yield consistent gains, so the at...
[7]

Acknowledgments This work was partially supported by the Brno University of Technology (internal project FIT-S-23-8151) and the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254)
[8]

The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI Models (specifically Google Gemini, ChatGPT, and Gram- marly) for language editing and text refinement. The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content
[9]

The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,

A. Firc and K. Malinka, “The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,” ser. SAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1646–1655. [Online]. Available: https://doi.org/10.1145/3477314.3507013

work page doi:10.1145/3477314.3507013 2022
[10]

Resilience of voice assistants to synthetic speech,

K. Malinka, A. Firc, P. Kaˇska, T. Lapˇsansk´y, O. ˇSandor, and I. Ho- moliak, “Resilience of voice assistants to synthetic speech,” in Computer Security – ESORICS 2024, J. Garcia-Alfaro, R. Kozik, M. Chora ´s, and S. Katsikas, Eds. Cham: Springer Nature Switzerland, 2024, pp. 66–84

2024
[11]

Assessing the human ability to recognize synthetic speech in ordinary conversation,

D. Prudk ´y, A. Firc, and K. Malinka, “Assessing the human ability to recognize synthetic speech in ordinary conversation,” in2023 International Conference of the Biometrics Special Interest Group (BIOSIG), 2023, pp. 1–5

2023
[12]

Xls-r: Self-supervised cross- lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “Xls-r: Self-supervised cross- lingual speech representation learning at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2111.09296

arXiv 2021
[13]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020. [Online]. Available: https://arxiv.org/abs/ 2006.11477

arXiv 2020
[14]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, p. 1505–1518, July 2022. [Online]....

work page doi:10.1109/jstsp.2022.3188113 2022
[15]

Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6367–6371

2022
[16]

Automatic Speaker Verification Spoofing and Deep- fake Detection Using Wav2vec 2.0 and Data Augmentation,

H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deep- fake Detection Using Wav2vec 2.0 and Data Augmentation,” inThe Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 112–119

2022
[17]

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification,

J. Peng, O. Plchot, T. Stafylakis, L. Mosner, L. Burget, and J. Cernocky, “An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification,” 2022. [Online]. Available: https://arxiv.org/abs/2210.01273

arXiv 2022
[18]

Ca-mhfa: A context-aware multi-head fac- torized attentive pooling for ssl-based speaker verification,

J. Peng, L. Mo ˇsner, L. Zhang, O. Plchot, T. Stafylakis, L. Bur- get, and J. ˇCernock´y, “Ca-mhfa: A context-aware multi-head fac- torized attentive pooling for ssl-based speaker verification,” in ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025
[19]

Audio deepfake detection with self-supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https://doi.org/10.1145/ 3664647.3681345

arXiv 2024
[20]

BUT systems and analyses for the ASVspoof 5 Challenge,

J. Rohdin, L. Zhang, P. Old ˇrich, V . Stanˇek, D. Mihola, J. Peng, T. Stafylakis, D. Beveraki, A. Silnova, J. Brukner, and L. Bur- get, “BUT systems and analyses for the ASVspoof 5 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 24–31

2024
[21]

Evaluation framework for deepfake speech detection: a comparative study of state-of- the-art deepfake speech detectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Evaluation framework for deepfake speech detection: a comparative study of state-of- the-art deepfake speech detectors,”Cybersecurity, vol. 8, no. 1, p. 50, Aug 2025. [Online]. Available: https://doi.org/10.1186/ s42400-024-00346-1

2025
[22]

Asvspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “Asvspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

2024
[23]

AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,

K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55

2024
[24]

Towards scalable aasist: Refining graph attention for speech deepfake detection,

I. Viakhirev, D. Sirota, A. Smirnov, and K. Borodin, “Towards scalable aasist: Refining graph attention for speech deepfake detection,” 2025. [Online]. Available: https://arxiv.org/abs/2507. 11777

2025
[25]

Deepfakes as a threat to a speaker and facial recognition: an overview of tools and attack vectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfakes as a threat to a speaker and facial recognition: an overview of tools and attack vectors,”Heliyon, vol. 9, no. 4, pp. 1– 33, 2023. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S2405844023022971

2023
[26]

Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6382–6386

2022
[27]

Asvspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. M ¨uller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. L. Maguer, C. Gong, H. Guo, L. Chen, and V . Singh, “Asvspoof 5: Design, collection and validation ...

arXiv 2025
[28]

Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, and H. Delgado, “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” 2021. [Online]. Available: https://arxiv.org/abs/2109.00537

arXiv 2021
[29]

Add 2023: the second audio deepfake detection challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Ren, L. Xu, J. Zhou, H. Gu, Z. Wen, S. Liang, Z. Lian, S. Nie, and H. Li, “Add 2023: the second audio deepfake detection challenge,” 2023. [Online]. Available: https://arxiv.org/abs/2305.13774

arXiv 2023
[30]

Does audio deepfake detection generalize?

N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B ¨ottinger, “Does audio deepfake detection generalize?”In- terspeech, 2022

2022

[1] [1]

Introduction Speech spoofing and deepfake generation methods continue to improve in quality and diversity, increasing the de- mand for robust, deployable spoofing countermeasures [1, 2, 3]. Modern detection pipelines typically combine a pre- trained self-supervised learning front-end (e.g.,XLS-R[4], Wav2Vec2.0[5] orWavLM[6]), with a learnable pooling back...

[2] [2]

Related Work Anti-spoofing countermeasures:Early countermeasures re- lied on handcrafted acoustic features and shallow classi- fiers [17]. Deep learning approaches later improved separation by learning robust representations directly from waveforms or spectrograms, often aided by data augmentation [18] and chal- lenge protocols [19, 20, 21]. Deep learning...

Pith/arXiv arXiv 2026

[3] [3]

For an input utterance, the processing can be summarized as: 1.Front-end feature extraction:a pretrained SSL encoder produces frame-level representations

Methodology AASIST[8] follows a graph-based pooling backend on top of an SSL front-end. For an input utterance, the processing can be summarized as: 1.Front-end feature extraction:a pretrained SSL encoder produces frame-level representations. 2.Graph construction:the utterance is mapped to graph nodes (frames or regions), typically forming parallel spectr...

[4] [4]

lower is better

Experiment Setup 4.1. Architectural Backbone and Feature Pipeline Our experiments utilize a unified Self-Supervised Learn- ing (SSL) frontend. We employ theWav2Vec2.0 XLS-R (300M)front-end to generate frame-level represen- tations. These features are subsequently integrated by an AASIST/SpAArSISTpooling layer, resulting in a fixed-size utterance embedding...

[5] [5]

The top-ranked systems pair magnitude-based scoring with more aggressive pruning (ktr = 0.3,k inf = 0.1), remaining competitive on ASVspoof 5 while improving robustness on ITW

Results Table 1 reports representativeAASIST/SpAArSISTconfigu- rations, metrics, and composite scores. The top-ranked systems pair magnitude-based scoring with more aggressive pruning (ktr = 0.3,k inf = 0.1), remaining competitive on ASVspoof 5 while improving robustness on ITW. Table 2 compares pooling backends under the sameXLS-R front end and training ...

arXiv 2094

[6] [6]

Conclusion We proposedSpAArSIST, a deployment-oriented simplifica- tion of theAASISTgraph backend in anXLS-Rpipeline. Our results indicate that several commonly used graph components are not essential for robust spoofing detection: the stack-node attention behaves close to mean aggregation, and temperature tuning does not yield consistent gains, so the at...

[7] [7]

Acknowledgments This work was partially supported by the Brno University of Technology (internal project FIT-S-23-8151) and the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254)

[8] [8]

The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI Models (specifically Google Gemini, ChatGPT, and Gram- marly) for language editing and text refinement. The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

[9] [9]

The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,

A. Firc and K. Malinka, “The dawn of a text-dependent society: deepfakes as a threat to speech verification systems,” ser. SAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1646–1655. [Online]. Available: https://doi.org/10.1145/3477314.3507013

work page doi:10.1145/3477314.3507013 2022

[10] [10]

Resilience of voice assistants to synthetic speech,

K. Malinka, A. Firc, P. Kaˇska, T. Lapˇsansk´y, O. ˇSandor, and I. Ho- moliak, “Resilience of voice assistants to synthetic speech,” in Computer Security – ESORICS 2024, J. Garcia-Alfaro, R. Kozik, M. Chora ´s, and S. Katsikas, Eds. Cham: Springer Nature Switzerland, 2024, pp. 66–84

2024

[11] [11]

Assessing the human ability to recognize synthetic speech in ordinary conversation,

D. Prudk ´y, A. Firc, and K. Malinka, “Assessing the human ability to recognize synthetic speech in ordinary conversation,” in2023 International Conference of the Biometrics Special Interest Group (BIOSIG), 2023, pp. 1–5

2023

[12] [12]

Xls-r: Self-supervised cross- lingual speech representation learning at scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “Xls-r: Self-supervised cross- lingual speech representation learning at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2111.09296

arXiv 2021

[13] [13]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020. [Online]. Available: https://arxiv.org/abs/ 2006.11477

arXiv 2020

[14] [14]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, p. 1505–1518, July 2022. [Online]....

work page doi:10.1109/jstsp.2022.3188113 2022

[15] [15]

Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6367–6371

2022

[16] [16]

Automatic Speaker Verification Spoofing and Deep- fake Detection Using Wav2vec 2.0 and Data Augmentation,

H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deep- fake Detection Using Wav2vec 2.0 and Data Augmentation,” inThe Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 112–119

2022

[17] [17]

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification,

J. Peng, O. Plchot, T. Stafylakis, L. Mosner, L. Burget, and J. Cernocky, “An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification,” 2022. [Online]. Available: https://arxiv.org/abs/2210.01273

arXiv 2022

[18] [18]

Ca-mhfa: A context-aware multi-head fac- torized attentive pooling for ssl-based speaker verification,

J. Peng, L. Mo ˇsner, L. Zhang, O. Plchot, T. Stafylakis, L. Bur- get, and J. ˇCernock´y, “Ca-mhfa: A context-aware multi-head fac- torized attentive pooling for ssl-based speaker verification,” in ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025

[19] [19]

Audio deepfake detection with self-supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 6765–6773. [Online]. Available: https://doi.org/10.1145/ 3664647.3681345

arXiv 2024

[20] [20]

BUT systems and analyses for the ASVspoof 5 Challenge,

J. Rohdin, L. Zhang, P. Old ˇrich, V . Stanˇek, D. Mihola, J. Peng, T. Stafylakis, D. Beveraki, A. Silnova, J. Brukner, and L. Bur- get, “BUT systems and analyses for the ASVspoof 5 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 24–31

2024

[21] [21]

Evaluation framework for deepfake speech detection: a comparative study of state-of- the-art deepfake speech detectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Evaluation framework for deepfake speech detection: a comparative study of state-of- the-art deepfake speech detectors,”Cybersecurity, vol. 8, no. 1, p. 50, Aug 2025. [Online]. Available: https://doi.org/10.1186/ s42400-024-00346-1

2025

[22] [22]

Asvspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “Asvspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

2024

[23] [23]

AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,

K. Borodin, V . Kudryavtsev, D. Korzh, A. Efimenko, G. Mkrtchian, M. Gorodnichev, and O. Y . Rogov, “AASIST3: KAN-enhanced AASIST speech deepfake detection using SSL features and additional regularization for the ASVspoof 2024 Challenge,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 48–55

2024

[24] [24]

Towards scalable aasist: Refining graph attention for speech deepfake detection,

I. Viakhirev, D. Sirota, A. Smirnov, and K. Borodin, “Towards scalable aasist: Refining graph attention for speech deepfake detection,” 2025. [Online]. Available: https://arxiv.org/abs/2507. 11777

2025

[25] [25]

Deepfakes as a threat to a speaker and facial recognition: an overview of tools and attack vectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfakes as a threat to a speaker and facial recognition: an overview of tools and attack vectors,”Heliyon, vol. 9, no. 4, pp. 1– 33, 2023. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S2405844023022971

2023

[26] [26]

Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6382–6386

2022

[27] [27]

Asvspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunen, N. Evans, K. A. Lee, J. Yamagishi, M. Jeong, G. Zhu, Y . Zang, Y . Zhang, S. Maiti, F. Lux, N. M ¨uller, W. Zhang, C. Sun, S. Hou, S. Lyu, S. L. Maguer, C. Gong, H. Guo, L. Chen, and V . Singh, “Asvspoof 5: Design, collection and validation ...

arXiv 2025

[28] [28]

Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans, and H. Delgado, “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” 2021. [Online]. Available: https://arxiv.org/abs/2109.00537

arXiv 2021

[29] [29]

Add 2023: the second audio deepfake detection challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Ren, L. Xu, J. Zhou, H. Gu, Z. Wen, S. Liang, Z. Lian, S. Nie, and H. Li, “Add 2023: the second audio deepfake detection challenge,” 2023. [Online]. Available: https://arxiv.org/abs/2305.13774

arXiv 2023

[30] [30]

Does audio deepfake detection generalize?

N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B ¨ottinger, “Does audio deepfake detection generalize?”In- terspeech, 2022

2022