pith. sign in

arxiv: 2605.29531 · v2 · pith:U5YQ4DVHnew · submitted 2026-05-28 · 💻 cs.SD · cs.CV· cs.LG

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

Pith reviewed 2026-06-29 06:00 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.LG
keywords audio deepfake detectionhalf-truth localisationcross-attentive fusionternary classificationboundary regressionMFCCLFCCChroma-STFT
0
0 comments X

The pith

A 576k-parameter model jointly classifies real, fully fake, and half-truth audio while regressing the boundaries of spliced segments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CAFNet to tackle partially manipulated speech, a realistic threat where only a short synthesised portion is inserted into genuine audio. It performs ternary classification and boundary regression in one pass by fusing MFCC, LFCC, and Chroma-STFT features through cross-attention. The model is evaluated on the MLADDC T2+T3 test set, reporting 92.71 percent ternary accuracy, 0.9910 macro AUC, and 0.075 second boundary MAE. On binary detection it reaches 96.76 percent accuracy and 3.20 percent EER while using over 500 times fewer parameters than fine-tuned XLS-R or AST models. A cross-dataset study shows that standard fine-tuning of large backbones collapses performance across domains.

Core claim

CAFNet fuses MFCC, LFCC, and Chroma-STFT features through parallel depthwise-separable convolution branches with cross-attention, followed by a BiLSTM regression head for boundary prediction. On the combined MLADDC T2+T3 test set it achieves 92.71 percent accuracy and macro AUC of 0.9910 for ternary classification, boundary localisation MAE of 0.075 seconds, and on binary detection 96.76 percent accuracy with 3.20 percent EER, outperforming fine-tuned XLS-R 300M and AST 87M at over 500 times fewer parameters.

What carries the argument

Cross-attentive feature fusion: parallel depthwise-separable convolution branches on MFCC, LFCC, and Chroma-STFT inputs with cross-attention between branches, plus BiLSTM regression head for joint classification and boundary regression.

If this is right

  • Joint ternary classification and boundary regression can be performed in a single forward pass without separate models for detection and localisation.
  • The small parameter count enables deployment in resource-constrained settings while still outperforming much larger pre-trained models on both tasks.
  • Cross-dataset evaluation reveals that standard fine-tuning of large backbones leads to collapsed representations even with reduced learning rates.
  • The reported median boundary error of 0.052 seconds provides a concrete target for localisation precision in half-truth scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cross-attention fusion of multiple cepstral features could be adapted to localise manipulations in other sequential signals such as video or sensor data.
  • Lightweight models of this form may support real-time monitoring in applications like voice authentication or social media audio moderation.
  • The ternary output and boundary regression together could support downstream tasks such as automated editing or forensic timeline reconstruction.
  • Testing the architecture on datasets with multiple spliced segments or varying splice lengths would clarify the limits of the current boundary regression head.

Load-bearing premise

The MLADDC T2+T3 test partitions contain held-out half-truth examples whose generation process and acoustic conditions are representative of real-world partial manipulations and were not seen during any hyperparameter search or model selection.

What would settle it

Evaluating CAFNet on a new collection of real-world spliced audio recordings generated by different synthesis methods and recording conditions than those in MLADDC would test whether the reported accuracy and localisation error hold.

Figures

Figures reproduced from arXiv: 2605.29531 by Remya K. Sasi, S. Sutharya.

Figure 1
Figure 1. Figure 1: MFAAN: three parallel 2D-CNN paths for MFCC, LFCC, and Chroma-STFT, fol [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CAFNet: parallel EnhancedPath branches, CrossAttnFusion module, classification [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CAFNet confusion matrix on the unified MLADDC T2+T3 test set. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative boundary predictions. Top: best case – predicted boundaries match ground truth within one analysis frame. Bottom: hard case – low pHT (0.194) coincides with a 2.54 s localisation error [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CAFNet, a 576k-parameter model for joint ternary classification (real, fully-fake, half-truth) and temporal boundary regression on partially manipulated ('half-truth') audio. It fuses MFCC, LFCC, and Chroma-STFT features via parallel depthwise-separable convolution branches with cross-attention, followed by a BiLSTM regression head. On the combined MLADDC T2+T3 test set the model reports 92.71% ternary accuracy and 0.9910 macro AUC, 0.075 s boundary MAE (median 0.052 s), and for binary detection 96.76% accuracy with 3.20% EER, outperforming fine-tuned XLS-R 300M and AST 87M at >500× fewer parameters. A cross-dataset study on representation collapse under fine-tuning is also presented.

Significance. If the performance numbers are obtained on genuinely held-out half-truth examples whose synthesis pipelines and acoustic conditions were never seen during training or model selection, the result would be significant: it shows that a compact, multi-task architecture can simultaneously solve detection and localisation for a more realistic threat model while remaining deployable on edge devices. The explicit parameter count and joint-training formulation are concrete strengths.

major comments (2)
  1. [§4] §4 (Experiments), data-partition paragraph: the construction of the MLADDC T2 and T3 test partitions is not described (no statement on speaker overlap, vocoder overlap, recording conditions, or whether any half-truth examples were used in hyper-parameter search). Because the headline metrics (92.71% ternary accuracy, 0.9910 macro AUC, 3.20% EER) are only interpretable under a strict held-out regime, this omission is load-bearing for the central claim.
  2. [§3, §4.2] §3 (Model) and §4.2 (Training protocol): it is not stated whether boundary regression is trained jointly with the ternary classifier or post-hoc, nor which loss combination (cross-entropy + L1/L2 on boundaries) is used. The reported 0.075 s MAE cannot be assessed without this information, which directly affects the joint-task claim.
minor comments (2)
  1. [§4.3] The abstract and §4.3 state that CAFNet 'outperforms' XLS-R and AST but do not report the exact fine-tuning protocol (learning-rate schedule, number of epochs, data augmentation) used for the baselines; adding a short table would improve reproducibility.
  2. [Figure 2] Figure 2 (architecture diagram) uses an inconsistent arrow style for the cross-attention blocks; a single legend would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The two major comments highlight important omissions in the experimental description that affect interpretability of the results. We address each point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments), data-partition paragraph: the construction of the MLADDC T2 and T3 test partitions is not described (no statement on speaker overlap, vocoder overlap, recording conditions, or whether any half-truth examples were used in hyper-parameter search). Because the headline metrics (92.71% ternary accuracy, 0.9910 macro AUC, 3.20% EER) are only interpretable under a strict held-out regime, this omission is load-bearing for the central claim.

    Authors: We agree that the data-partition details are insufficiently described and that this information is necessary to support the held-out evaluation claims. The MLADDC T2 and T3 partitions were constructed with no speaker overlap with the training set, distinct vocoders, and different recording conditions from the training data; no half-truth examples were included in hyper-parameter search or validation. We will expand the data-partition paragraph in §4 to explicitly document these properties, citing the original MLADDC dataset splits. revision: yes

  2. Referee: [§3, §4.2] §3 (Model) and §4.2 (Training protocol): it is not stated whether boundary regression is trained jointly with the ternary classifier or post-hoc, nor which loss combination (cross-entropy + L1/L2 on boundaries) is used. The reported 0.075 s MAE cannot be assessed without this information, which directly affects the joint-task claim.

    Authors: The boundary regression is trained jointly with the ternary classifier in a single multi-task objective. The combined loss is cross-entropy for the three-class classification head plus L1 loss on the boundary regression head, with equal weighting between the two terms. We will revise §3 (Model) and §4.2 (Training protocol) to state this joint training procedure and loss formulation explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity; results are empirical test-set metrics with no derivation chain

full rationale

The paper reports empirical accuracies, AUC, MAE, and EER on the MLADDC T2+T3 test partitions for a 576k-parameter model CAFNet, with direct comparisons to fine-tuned XLS-R and AST. No equations, first-principles derivations, or predictions are claimed that reduce to fitted inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear. The work is self-contained against external benchmarks and receives the default non-finding for an empirical ML paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest on the assumption that the MLADDC T2+T3 test set is a valid, unseen benchmark and that standard audio features plus the chosen fusion mechanism generalize; the 576k parameters are fitted quantities whose selection process is not detailed.

free parameters (1)
  • CAFNet weights and training hyperparameters
    The 576k parameters and all training choices are fitted to the training portion of MLADDC.
axioms (1)
  • domain assumption MLADDC T2+T3 test examples are representative of real-world half-truth manipulations and were not used in model selection
    All reported accuracies and MAE values depend on this partition being truly held-out and distributionally relevant.

pith-pipeline@v0.9.1-grok · 5851 in / 1401 out tokens · 41186 ms · 2026-06-29T06:00:34.875876+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages

  1. [1]

    J. Kong, J. Kim, J. Bae, HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst.33, 17022–17033 (2020). https://doi.org/10.48550/arXiv.2010.05646

  2. [2]

    S. Lee, W. Ping, B. Ginsburg, B. Catanzaro, S. Yoon, BigVGAN: a universal neural vocoder with large-scale training, inProc. ICLR(2023).https://doi.org/10.48550/ arXiv.2206.04658

  3. [3]

    K.T. Mai, S. Bray, T. Davies, L.D. Griffin, Warning: humans cannot reliably detect speech deepfakes. PLoS ONE18(8), e0285333 (2023).https://doi.org/10.1371/journal. pone.0285333

  4. [4]

    Shah, R.M

    A.J. Shah, R.M. Purohit, D.H. Vaghera, H.A. Patil, MLADDC: multi-lingual audio deep- fake detection corpus, inAudio Imagination: NeurIPS 2024 Workshop(2024).https: //openreview.net/forum?id=ic3HvoOTeU

  5. [5]

    Krishnan, Koushik S

    Karthik S. Krishnan, Koushik S. Krishnan, MFAAN: unveiling audio deepfakes with a multi-feature authenticity network, inProc. 9th Int. Conf. Signal Process. Commun. (ICSC), pp. 585–590 (2023).https://doi.org/10.1109/ICSC60394.2023.10441405

  6. [6]

    Khochare, C

    J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, F. Kazi, A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng.47(3), 3447–3458 (2022).https://doi.org/ 10.1007/s13369-021-06297-w

  7. [7]

    Hamza, A.R

    A. Hamza, A.R. Javed, F. Iqbal, N. Kryvinska, A.S. Almadhor, Z. Jalil, R. Borghol, Deepfake audio detection via MFCC features using machine learning. IEEE Access10, 134018–134028 (2022).https://doi.org/10.1109/ACCESS.2022.3231480

  8. [8]

    Reimao, V

    R. Reimao, V. Tzerpos, FoR: a dataset for synthetic speech detection, inProc. 2019 Int. Conf. Speech Technol. Human-Comput. Dialogue (SpeD), pp. 1–10. IEEE (2019)

  9. [9]

    Frank, L

    J. Frank, L. Sch¨ onherr, WaveFake: a data set to facilitate audio deepfake detection, in Proc. NeurIPS 2021 Track Datasets Benchmarks(2021).https://doi.org/10.48550/ arXiv.2111.02813 12

  10. [10]

    Yamagishi, X

    J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K.A. Lee, T. Kinnunen, N. Evans, H. Delgado, ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection, inProc. ASVspoof 2021 Workshop, pp. 47–58 (2021).https://doi.org/10.21437/ASVSPOOF.2021-8

  11. [11]

    M¨ uller, P

    N.M. M¨ uller, P. Czempin, F. Diekmann, A. Froghyar, K. B¨ ottinger, Does audio deepfake detection generalize? inProc. Interspeech 2022, pp. 2783–2787 (2022)

  12. [12]

    J. Yi, Y. Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, R. Fu, Half-truth: a partially fake audio detection dataset, inProc. Interspeech 2021, pp. 1654–1658 (2021).https: //doi.org/10.21437/Interspeech.2021-930

  13. [13]

    A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, M. Auli, XLS-R: self-supervised cross-lingual speech representation learning at scale, inProc. Interspeech 2022, pp. 2278–2282 (2022)

  14. [14]

    Gong, Y.-A

    Y. Gong, Y.-A. Chung, J. Glass, AST: audio spectrogram transformer, inProc. Interspeech 2021, pp. 571–575 (2021).https://doi.org/10.21437/Interspeech.2021-698

  15. [15]

    Baevski, Y

    A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst.33, 12449–12460 (2020). https://doi.org/10.48550/arXiv.2006.11477

  16. [16]

    Parisi, R

    G.I. Parisi, R. Kemker, J.L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: a review. Neural Netw.113, 54–71 (2019).https://doi.org/10.1016/ j.neunet.2019.01.012

  17. [17]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell, Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017).https://doi.org/10.1073/pnas.1611835114

  18. [18]

    Z. Cai, M. Li, Integrating frame-level boundary detection and deepfake detection for locat- ing manipulated regions in partially spoofed audio forgery attacks. Comput. Speech Lang. 85, 101597 (2024).https://doi.org/10.1016/j.csl.2023.101597 13