Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

Remya K. Sasi; S. Sutharya

arxiv: 2605.29531 · v2 · pith:U5YQ4DVHnew · submitted 2026-05-28 · 💻 cs.SD · cs.CV· cs.LG

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

S. Sutharya , Remya K. Sasi This is my paper

Pith reviewed 2026-06-29 06:00 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.LG

keywords audio deepfake detectionhalf-truth localisationcross-attentive fusionternary classificationboundary regressionMFCCLFCCChroma-STFT

0 comments

The pith

A 576k-parameter model jointly classifies real, fully fake, and half-truth audio while regressing the boundaries of spliced segments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CAFNet to tackle partially manipulated speech, a realistic threat where only a short synthesised portion is inserted into genuine audio. It performs ternary classification and boundary regression in one pass by fusing MFCC, LFCC, and Chroma-STFT features through cross-attention. The model is evaluated on the MLADDC T2+T3 test set, reporting 92.71 percent ternary accuracy, 0.9910 macro AUC, and 0.075 second boundary MAE. On binary detection it reaches 96.76 percent accuracy and 3.20 percent EER while using over 500 times fewer parameters than fine-tuned XLS-R or AST models. A cross-dataset study shows that standard fine-tuning of large backbones collapses performance across domains.

Core claim

CAFNet fuses MFCC, LFCC, and Chroma-STFT features through parallel depthwise-separable convolution branches with cross-attention, followed by a BiLSTM regression head for boundary prediction. On the combined MLADDC T2+T3 test set it achieves 92.71 percent accuracy and macro AUC of 0.9910 for ternary classification, boundary localisation MAE of 0.075 seconds, and on binary detection 96.76 percent accuracy with 3.20 percent EER, outperforming fine-tuned XLS-R 300M and AST 87M at over 500 times fewer parameters.

What carries the argument

Cross-attentive feature fusion: parallel depthwise-separable convolution branches on MFCC, LFCC, and Chroma-STFT inputs with cross-attention between branches, plus BiLSTM regression head for joint classification and boundary regression.

If this is right

Joint ternary classification and boundary regression can be performed in a single forward pass without separate models for detection and localisation.
The small parameter count enables deployment in resource-constrained settings while still outperforming much larger pre-trained models on both tasks.
Cross-dataset evaluation reveals that standard fine-tuning of large backbones leads to collapsed representations even with reduced learning rates.
The reported median boundary error of 0.052 seconds provides a concrete target for localisation precision in half-truth scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cross-attention fusion of multiple cepstral features could be adapted to localise manipulations in other sequential signals such as video or sensor data.
Lightweight models of this form may support real-time monitoring in applications like voice authentication or social media audio moderation.
The ternary output and boundary regression together could support downstream tasks such as automated editing or forensic timeline reconstruction.
Testing the architecture on datasets with multiple spliced segments or varying splice lengths would clarify the limits of the current boundary regression head.

Load-bearing premise

The MLADDC T2+T3 test partitions contain held-out half-truth examples whose generation process and acoustic conditions are representative of real-world partial manipulations and were not seen during any hyperparameter search or model selection.

What would settle it

Evaluating CAFNet on a new collection of real-world spliced audio recordings generated by different synthesis methods and recording conditions than those in MLADDC would test whether the reported accuracy and localisation error hold.

Figures

Figures reproduced from arXiv: 2605.29531 by Remya K. Sasi, S. Sutharya.

**Figure 2.** Figure 2: CAFNet: parallel EnhancedPath branches, CrossAttnFusion module, classification [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: CAFNet confusion matrix on the unified MLADDC T2+T3 test set. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Representative boundary predictions. Top: best case – predicted boundaries match ground truth within one analysis frame. Bottom: hard case – low pHT (0.194) coincides with a 2.54 s localisation error [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAFNet is a compact cross-attentive model that jointly handles ternary classification and boundary regression for half-truth audio deepfakes, with competitive numbers on MLADDC but limited evidence on test-set independence.

read the letter

The core contribution is a 576k-parameter CAFNet that fuses MFCC, LFCC, and Chroma-STFT features via parallel depthwise-separable convolutions and cross-attention, then adds a BiLSTM head for joint ternary classification (real, fully fake, half-truth) and temporal boundary regression. On the combined MLADDC T2+T3 test set it reports 92.71% accuracy, 0.991 macro AUC, 0.075 s boundary MAE, and 3.20% EER on binary detection, beating fine-tuned XLS-R 300M and AST 87M at far lower compute. A cross-dataset experiment also shows that standard fine-tuning erodes cross-domain performance even with reduced learning rates.

The architecture choice and joint-task framing are the clearest advances; they directly target the partial-manipulation case that binary detectors ignore, and the small size is a practical plus. The reported metrics are internally consistent with the described pipeline.

The main uncertainty is whether the T2+T3 partitions are fully held-out. The abstract gives no explicit description of how the spliced segments were generated, whether any overlap exists with training data or hyperparameter search, or how acoustic conditions compare to real-world cases. If any leakage or distribution shift is present, the gains over the large baselines become harder to interpret. The cross-dataset results are helpful but do not resolve the in-domain split question.

This paper is aimed at audio-forensics and deepfake-detection researchers who need lightweight models for partial manipulations. It is coherent on its own terms and supplies concrete empirical results, so it merits a serious referee even if methods details and split verification require expansion.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CAFNet, a 576k-parameter model for joint ternary classification (real, fully-fake, half-truth) and temporal boundary regression on partially manipulated ('half-truth') audio. It fuses MFCC, LFCC, and Chroma-STFT features via parallel depthwise-separable convolution branches with cross-attention, followed by a BiLSTM regression head. On the combined MLADDC T2+T3 test set the model reports 92.71% ternary accuracy and 0.9910 macro AUC, 0.075 s boundary MAE (median 0.052 s), and for binary detection 96.76% accuracy with 3.20% EER, outperforming fine-tuned XLS-R 300M and AST 87M at >500× fewer parameters. A cross-dataset study on representation collapse under fine-tuning is also presented.

Significance. If the performance numbers are obtained on genuinely held-out half-truth examples whose synthesis pipelines and acoustic conditions were never seen during training or model selection, the result would be significant: it shows that a compact, multi-task architecture can simultaneously solve detection and localisation for a more realistic threat model while remaining deployable on edge devices. The explicit parameter count and joint-training formulation are concrete strengths.

major comments (2)

[§4] §4 (Experiments), data-partition paragraph: the construction of the MLADDC T2 and T3 test partitions is not described (no statement on speaker overlap, vocoder overlap, recording conditions, or whether any half-truth examples were used in hyper-parameter search). Because the headline metrics (92.71% ternary accuracy, 0.9910 macro AUC, 3.20% EER) are only interpretable under a strict held-out regime, this omission is load-bearing for the central claim.
[§3, §4.2] §3 (Model) and §4.2 (Training protocol): it is not stated whether boundary regression is trained jointly with the ternary classifier or post-hoc, nor which loss combination (cross-entropy + L1/L2 on boundaries) is used. The reported 0.075 s MAE cannot be assessed without this information, which directly affects the joint-task claim.

minor comments (2)

[§4.3] The abstract and §4.3 state that CAFNet 'outperforms' XLS-R and AST but do not report the exact fine-tuning protocol (learning-rate schedule, number of epochs, data augmentation) used for the baselines; adding a short table would improve reproducibility.
[Figure 2] Figure 2 (architecture diagram) uses an inconsistent arrow style for the cross-attention blocks; a single legend would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The two major comments highlight important omissions in the experimental description that affect interpretability of the results. We address each point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [§4] §4 (Experiments), data-partition paragraph: the construction of the MLADDC T2 and T3 test partitions is not described (no statement on speaker overlap, vocoder overlap, recording conditions, or whether any half-truth examples were used in hyper-parameter search). Because the headline metrics (92.71% ternary accuracy, 0.9910 macro AUC, 3.20% EER) are only interpretable under a strict held-out regime, this omission is load-bearing for the central claim.

Authors: We agree that the data-partition details are insufficiently described and that this information is necessary to support the held-out evaluation claims. The MLADDC T2 and T3 partitions were constructed with no speaker overlap with the training set, distinct vocoders, and different recording conditions from the training data; no half-truth examples were included in hyper-parameter search or validation. We will expand the data-partition paragraph in §4 to explicitly document these properties, citing the original MLADDC dataset splits. revision: yes
Referee: [§3, §4.2] §3 (Model) and §4.2 (Training protocol): it is not stated whether boundary regression is trained jointly with the ternary classifier or post-hoc, nor which loss combination (cross-entropy + L1/L2 on boundaries) is used. The reported 0.075 s MAE cannot be assessed without this information, which directly affects the joint-task claim.

Authors: The boundary regression is trained jointly with the ternary classifier in a single multi-task objective. The combined loss is cross-entropy for the three-class classification head plus L1 loss on the boundary regression head, with equal weighting between the two terms. We will revise §3 (Model) and §4.2 (Training protocol) to state this joint training procedure and loss formulation explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity; results are empirical test-set metrics with no derivation chain

full rationale

The paper reports empirical accuracies, AUC, MAE, and EER on the MLADDC T2+T3 test partitions for a 576k-parameter model CAFNet, with direct comparisons to fine-tuned XLS-R and AST. No equations, first-principles derivations, or predictions are claimed that reduce to fitted inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear. The work is self-contained against external benchmarks and receives the default non-finding for an empirical ML paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest on the assumption that the MLADDC T2+T3 test set is a valid, unseen benchmark and that standard audio features plus the chosen fusion mechanism generalize; the 576k parameters are fitted quantities whose selection process is not detailed.

free parameters (1)

CAFNet weights and training hyperparameters
The 576k parameters and all training choices are fitted to the training portion of MLADDC.

axioms (1)

domain assumption MLADDC T2+T3 test examples are representative of real-world half-truth manipulations and were not used in model selection
All reported accuracies and MAE values depend on this partition being truly held-out and distributionally relevant.

pith-pipeline@v0.9.1-grok · 5851 in / 1401 out tokens · 41186 ms · 2026-06-29T06:00:34.875876+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages

[1]

J. Kong, J. Kim, J. Bae, HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst.33, 17022–17033 (2020). https://doi.org/10.48550/arXiv.2010.05646

work page doi:10.48550/arxiv.2010.05646 2020
[2]

S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, S. Yoon, BigVGAN: a universal neural vocoder with large-scale training, inProc. ICLR(2023).https://doi.org/10.48550/ arXiv.2206.04658

work page arXiv 2023
[3]

K.T. Mai, S. Bray, T. Davies, L.D. Griffin, Warning: humans cannot reliably detect speech deepfakes. PLoS ONE18(8), e0285333 (2023).https://doi.org/10.1371/journal. pone.0285333

work page doi:10.1371/journal 2023
[4]

Shah, R.M

A.J. Shah, R.M. Purohit, D.H. Vaghera, H.A. Patil, MLADDC: multi-lingual audio deep- fake detection corpus, inAudio Imagination: NeurIPS 2024 Workshop(2024).https: //openreview.net/forum?id=ic3HvoOTeU

2024
[5]

Krishnan, Koushik S

Karthik S. Krishnan, Koushik S. Krishnan, MFAAN: unveiling audio deepfakes with a multi-feature authenticity network, inProc. 9th Int. Conf. Signal Process. Commun. (ICSC), pp. 585–590 (2023).https://doi.org/10.1109/ICSC60394.2023.10441405

work page doi:10.1109/icsc60394.2023.10441405 2023
[6]

Khochare, C

J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, F. Kazi, A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng.47(3), 3447–3458 (2022).https://doi.org/ 10.1007/s13369-021-06297-w

work page doi:10.1007/s13369-021-06297-w 2022
[7]

Hamza, A.R

A. Hamza, A.R. Javed, F. Iqbal, N. Kryvinska, A.S. Almadhor, Z. Jalil, R. Borghol, Deepfake audio detection via MFCC features using machine learning. IEEE Access10, 134018–134028 (2022).https://doi.org/10.1109/ACCESS.2022.3231480

work page doi:10.1109/access.2022.3231480 2022
[8]

Reimao, V

R. Reimao, V. Tzerpos, FoR: a dataset for synthetic speech detection, inProc. 2019 Int. Conf. Speech Technol. Human-Comput. Dialogue (SpeD), pp. 1–10. IEEE (2019)

2019
[9]

Frank, L

J. Frank, L. Sch¨ onherr, WaveFake: a data set to facilitate audio deepfake detection, in Proc. NeurIPS 2021 Track Datasets Benchmarks(2021).https://doi.org/10.48550/ arXiv.2111.02813 12

work page arXiv 2021
[10]

Yamagishi, X

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K.A. Lee, T. Kinnunen, N. Evans, H. Delgado, ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection, inProc. ASVspoof 2021 Workshop, pp. 47–58 (2021).https://doi.org/10.21437/ASVSPOOF.2021-8

work page doi:10.21437/asvspoof.2021-8 2021
[11]

M¨ uller, P

N.M. M¨ uller, P. Czempin, F. Diekmann, A. Froghyar, K. B¨ ottinger, Does audio deepfake detection generalize? inProc. Interspeech 2022, pp. 2783–2787 (2022)

2022
[12]

J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, R. Fu, Half-truth: a partially fake audio detection dataset, inProc. Interspeech 2021, pp. 1654–1658 (2021).https: //doi.org/10.21437/Interspeech.2021-930

work page doi:10.21437/interspeech.2021-930 2021
[13]

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, M. Auli, XLS-R: self-supervised cross-lingual speech representation learning at scale, inProc. Interspeech 2022, pp. 2278–2282 (2022)

2022
[14]

Gong, Y.-A

Y. Gong, Y.-A. Chung, J. Glass, AST: audio spectrogram transformer, inProc. Interspeech 2021, pp. 571–575 (2021).https://doi.org/10.21437/Interspeech.2021-698

work page doi:10.21437/interspeech.2021-698 2021
[15]

Baevski, Y

A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst.33, 12449–12460 (2020). https://doi.org/10.48550/arXiv.2006.11477

work page doi:10.48550/arxiv.2006.11477 2020
[16]

Parisi, R

G.I. Parisi, R. Kemker, J.L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: a review. Neural Netw.113, 54–71 (2019).https://doi.org/10.1016/ j.neunet.2019.01.012

2019
[17]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell, Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017).https://doi.org/10.1073/pnas.1611835114

work page doi:10.1073/pnas.1611835114 2017
[18]

Z. Cai, M. Li, Integrating frame-level boundary detection and deepfake detection for locat- ing manipulated regions in partially spoofed audio forgery attacks. Comput. Speech Lang. 85, 101597 (2024).https://doi.org/10.1016/j.csl.2023.101597 13

work page doi:10.1016/j.csl.2023.101597 2024

[1] [1]

J. Kong, J. Kim, J. Bae, HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst.33, 17022–17033 (2020). https://doi.org/10.48550/arXiv.2010.05646

work page doi:10.48550/arxiv.2010.05646 2020

[2] [2]

S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, S. Yoon, BigVGAN: a universal neural vocoder with large-scale training, inProc. ICLR(2023).https://doi.org/10.48550/ arXiv.2206.04658

work page arXiv 2023

[3] [3]

K.T. Mai, S. Bray, T. Davies, L.D. Griffin, Warning: humans cannot reliably detect speech deepfakes. PLoS ONE18(8), e0285333 (2023).https://doi.org/10.1371/journal. pone.0285333

work page doi:10.1371/journal 2023

[4] [4]

Shah, R.M

A.J. Shah, R.M. Purohit, D.H. Vaghera, H.A. Patil, MLADDC: multi-lingual audio deep- fake detection corpus, inAudio Imagination: NeurIPS 2024 Workshop(2024).https: //openreview.net/forum?id=ic3HvoOTeU

2024

[5] [5]

Krishnan, Koushik S

Karthik S. Krishnan, Koushik S. Krishnan, MFAAN: unveiling audio deepfakes with a multi-feature authenticity network, inProc. 9th Int. Conf. Signal Process. Commun. (ICSC), pp. 585–590 (2023).https://doi.org/10.1109/ICSC60394.2023.10441405

work page doi:10.1109/icsc60394.2023.10441405 2023

[6] [6]

Khochare, C

J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, F. Kazi, A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng.47(3), 3447–3458 (2022).https://doi.org/ 10.1007/s13369-021-06297-w

work page doi:10.1007/s13369-021-06297-w 2022

[7] [7]

Hamza, A.R

A. Hamza, A.R. Javed, F. Iqbal, N. Kryvinska, A.S. Almadhor, Z. Jalil, R. Borghol, Deepfake audio detection via MFCC features using machine learning. IEEE Access10, 134018–134028 (2022).https://doi.org/10.1109/ACCESS.2022.3231480

work page doi:10.1109/access.2022.3231480 2022

[8] [8]

Reimao, V

R. Reimao, V. Tzerpos, FoR: a dataset for synthetic speech detection, inProc. 2019 Int. Conf. Speech Technol. Human-Comput. Dialogue (SpeD), pp. 1–10. IEEE (2019)

2019

[9] [9]

Frank, L

J. Frank, L. Sch¨ onherr, WaveFake: a data set to facilitate audio deepfake detection, in Proc. NeurIPS 2021 Track Datasets Benchmarks(2021).https://doi.org/10.48550/ arXiv.2111.02813 12

work page arXiv 2021

[10] [10]

Yamagishi, X

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K.A. Lee, T. Kinnunen, N. Evans, H. Delgado, ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection, inProc. ASVspoof 2021 Workshop, pp. 47–58 (2021).https://doi.org/10.21437/ASVSPOOF.2021-8

work page doi:10.21437/asvspoof.2021-8 2021

[11] [11]

M¨ uller, P

N.M. M¨ uller, P. Czempin, F. Diekmann, A. Froghyar, K. B¨ ottinger, Does audio deepfake detection generalize? inProc. Interspeech 2022, pp. 2783–2787 (2022)

2022

[12] [12]

J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, R. Fu, Half-truth: a partially fake audio detection dataset, inProc. Interspeech 2021, pp. 1654–1658 (2021).https: //doi.org/10.21437/Interspeech.2021-930

work page doi:10.21437/interspeech.2021-930 2021

[13] [13]

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, M. Auli, XLS-R: self-supervised cross-lingual speech representation learning at scale, inProc. Interspeech 2022, pp. 2278–2282 (2022)

2022

[14] [14]

Gong, Y.-A

Y. Gong, Y.-A. Chung, J. Glass, AST: audio spectrogram transformer, inProc. Interspeech 2021, pp. 571–575 (2021).https://doi.org/10.21437/Interspeech.2021-698

work page doi:10.21437/interspeech.2021-698 2021

[15] [15]

Baevski, Y

A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst.33, 12449–12460 (2020). https://doi.org/10.48550/arXiv.2006.11477

work page doi:10.48550/arxiv.2006.11477 2020

[16] [16]

Parisi, R

G.I. Parisi, R. Kemker, J.L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: a review. Neural Netw.113, 54–71 (2019).https://doi.org/10.1016/ j.neunet.2019.01.012

2019

[17] [17]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell, Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017).https://doi.org/10.1073/pnas.1611835114

work page doi:10.1073/pnas.1611835114 2017

[18] [18]

Z. Cai, M. Li, Integrating frame-level boundary detection and deepfake detection for locat- ing manipulated regions in partially spoofed audio forgery attacks. Comput. Speech Lang. 85, 101597 (2024).https://doi.org/10.1016/j.csl.2023.101597 13

work page doi:10.1016/j.csl.2023.101597 2024