pith. sign in

arxiv: 2606.06666 · v1 · pith:OP3X7L7Mnew · submitted 2026-06-04 · 💻 cs.CV

Architecture-Adaptive Uncertainty Fusion for Deepfake Detection

Pith reviewed 2026-06-28 01:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords deepfake detectionuncertainty quantificationdistribution shiftcorrelation optimizationfusion methodscomputer visionforensic deploymentprediction reliability
0
0 comments X

The pith

Fusing five uncertainty sources by maximizing their correlation with prediction errors yields architecture-specific weights that retain more signal under distribution shift than random forest or nonlinear alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Correlation-Optimized Fusion as a way to combine epistemic, aleatoric, calibration, conformal, and distributional uncertainty estimates without altering the underlying deepfake detector. Weights are found by solving a constrained optimization on the probability simplex that maximizes Pearson correlation between the fused score and observed errors. The procedure needs only seconds of computation per architecture. In matched train-test conditions nonlinear methods reach higher correlation, yet under distribution shift the linear correlation-tuned weights outperform random forest on nine of eleven tested models and degrade less severely. All approaches, including COF, lose nearly all correlation when evaluated on entirely new datasets.

Core claim

COF solves a constrained optimization on the probability simplex to find linear weights for five complementary uncertainty sources that maximize Pearson correlation with prediction errors. On FaceForensics++ this produces slightly lower in-domain correlation than nonlinear fusion, but on CelebDF the same weights outperform random forest in nine of eleven architectures and retain substantially more signal after distribution shift. Cross-dataset evaluation on CelebDF and DFDC shows that every method suffers roughly 90 percent degradation, with seven architectures exhibiting uncertainty inversion.

What carries the argument

Correlation-Optimized Fusion (COF): linear combination of five uncertainty sources whose weights are chosen by simplex-constrained maximization of Pearson correlation with observed prediction errors.

If this is right

  • COF applies to any existing detector with no model changes and only 42 seconds of weight search.
  • Under distribution shift the correlation-optimized linear fusion retains more predictive power than random forest in most architectures.
  • Nonlinear fusion methods lose more performance than linear COF when the data distribution changes.
  • Cross-dataset uncertainty estimates collapse to near-zero correlation for all tested methods.
  • COF identifies domain-adaptive uncertainty quantification as the remaining barrier to forensic use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The consistent outperformance of correlation-tuned linear weights under shift implies that architecture-specific linear fusion may be preferable to black-box alternatives when deployment stays within controlled distribution ranges.
  • The near-total loss of correlation on new datasets indicates that current uncertainty sources largely capture dataset-specific artifacts rather than intrinsic model reliability.
  • Monitoring whether the fused uncertainty remains positively correlated with errors on incoming batches could serve as a practical out-of-distribution detector.
  • The observed trade-off between matched-condition accuracy and shift robustness may appear in uncertainty quantification tasks outside deepfake detection.

Load-bearing premise

That weights chosen to maximize correlation on one evaluation distribution will still produce informative fused uncertainties when the input images come from a shifted distribution.

What would settle it

Compute COF weights on FaceForensics++ and test whether they still produce higher error correlation than random forest or equal-weight fusion on a fresh shifted dataset such as CelebDF; reversal of the reported advantage would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2606.06666 by Mohammad Ghasemigol, Ritesh Sharma, Yuichi Motai.

Figure 1
Figure 1. Figure 1: Individual uncertainty source correlations ( [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: COF outperforms MC Dropout in all eleven architectures. (a) Absolute correlation: COF (blue) consistently exceeds MC Dropout (gray) across all [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: COF-5 learned fusion weights across eleven architectures. Three [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-domain robustness reversal: COF vs. Random Forest under matched protocol. (a) In-domain (FF++): RF achieves marginally higher correlation than COF (mean ∆ρ = +0.025; 0.463 vs. 0.438, a 5.7% gap). (b) Cross-domain (CelebDF): COF outperforms RF in 9/11 architectures, with up to 7.3× higher correlation (MaxViT-B: ρ = 0.249 vs. 0.034). The simplex constraint that limits COF’s in-domain expressiveness act… view at source ↗
Figure 5
Figure 5. Figure 5: Uncertainty inversion analysis: in-domain vs. cross-domain correlation for each architecture. Points below the horizontal dashed line ( [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Deepfake detection systems achieve near-perfect accuracy on benchmarks, yet forensic deployment demands reliable prediction uncertainty. Existing uncertainty quantification (UQ) methods rely on single sources and ignore that optimal uncertainty composition varies across architectures. We propose Correlation-Optimized Fusion (COF), an architecture-adaptive framework that fuses five complementary uncertainty sources -- epistemic, aleatoric, calibration, conformal, and distributional -- by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex. COF requires no model modifications and only 42 s of weight optimization, compared to 20--45 h for a 5-model Deep Ensemble. Evaluation across eleven architectures on FaceForensics++ reveals a fundamental trade-off: under matched train/evaluation protocol, non-linear methods achieve approximately 5--6% higher in-domain correlation than COF (mean r = 0.438), but this reverses under distribution shift. On CelebDF, COF outperforms Random Forest in 9/11 architectures with up to 7.3x higher correlation (MaxViT-B: r = 0.249 vs. 0.034); RF degrades 85% cross-domain to r = 0.071, whereas COF retains substantially more signal (74% drop to r = 0.116). Cross-dataset evaluation on CelebDF and DFDC reveals catastrophic generalization failure across all methods: in-domain correlations of 0.41--0.47 collapse to near-zero externally (mean degradation 90.7%), with seven of eleven architectures exhibiting uncertainty inversion. These results establish COF as a practical, interpretable framework for controlled-distribution deployment and identify domain-adaptive UQ as the central open challenge for forensic deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Correlation-Optimized Fusion (COF), an architecture-adaptive method that fuses five uncertainty sources (epistemic, aleatoric, calibration, conformal, distributional) for deepfake detectors by solving a constrained optimization on the probability simplex to maximize Pearson correlation between the fused score and observed prediction errors. It reports that non-linear baselines outperform COF in-domain on FaceForensics++ (mean r ≈ 0.438 vs. lower for COF), but COF outperforms Random Forest on CelebDF in 9/11 architectures (e.g., MaxViT-B: r=0.249 vs. 0.034) and degrades less under shift (74% drop to r=0.116 vs. RF 85% drop to 0.071); all methods suffer near-total collapse (mean 90.7% degradation) on further cross-dataset evaluation to DFDC.

Significance. If the optimization split is out-of-sample, the work usefully demonstrates that linear fusion weights can be architecture-specific and more robust than non-linear alternatives under controlled distribution shift, while quantifying the severe generalization failure of current UQ methods; the 42-second optimization cost versus ensemble training time is a practical advantage. The identification of domain-adaptive UQ as the central open problem is well-supported by the reported cross-dataset collapse.

major comments (2)
  1. [Abstract and method description] The central empirical claims on CelebDF (outperformance vs. RF and reduced degradation under shift) rest on the COF weight optimization procedure. The abstract and description state that weights are obtained by maximizing Pearson correlation with prediction errors, but provide no indication of whether this optimization uses a held-out validation partition, source-domain (FF++) data only, or the CelebDF evaluation predictions themselves. If performed on the same CelebDF errors used to report the r values, the superiority is expected by construction and does not establish that the weights remain informative for new inputs.
  2. [Evaluation on CelebDF and cross-dataset results] Table or figure reporting the CelebDF and cross-domain results (e.g., the 9/11 outperformance and degradation percentages) must include the exact data partition used for weight optimization, the number of optimization runs, and any statistical tests or error bars on the reported Pearson r values; without this, the cross-domain retention claim (74% vs. 85% drop) cannot be evaluated for robustness.
minor comments (2)
  1. [Method] Clarify whether the five uncertainty sources are computed from a single forward pass or require multiple inferences; the 42 s optimization time suggests the former, but this should be stated explicitly.
  2. [Cross-dataset evaluation] The claim of 'catastrophic generalization failure' across all methods would be strengthened by reporting the raw in-domain and out-of-domain r values for each of the eleven architectures rather than only means and selected examples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for clarity on the optimization procedure and for suggesting improvements to our evaluation reporting. We address each point below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [Abstract and method description] The central empirical claims on CelebDF (outperformance vs. RF and reduced degradation under shift) rest on the COF weight optimization procedure. The abstract and description state that weights are obtained by maximizing Pearson correlation with prediction errors, but provide no indication of whether this optimization uses a held-out validation partition, source-domain (FF++) data only, or the CelebDF evaluation predictions themselves. If performed on the same CelebDF errors used to report the r values, the superiority is expected by construction and does not establish that the weights remain informative for new inputs.

    Authors: The weight optimization for each architecture is performed exclusively on the source-domain FaceForensics++ dataset using a held-out validation partition within FF++. The optimized weights are then transferred to the CelebDF evaluation set without any further fitting. This design choice is what allows us to demonstrate the robustness of the linear fusion under distribution shift. We will revise the abstract and method sections to explicitly describe this source-domain optimization procedure. revision: yes

  2. Referee: [Evaluation on CelebDF and cross-dataset results] Table or figure reporting the CelebDF and cross-domain results (e.g., the 9/11 outperformance and degradation percentages) must include the exact data partition used for weight optimization, the number of optimization runs, and any statistical tests or error bars on the reported Pearson r values; without this, the cross-domain retention claim (74% vs. 85% drop) cannot be evaluated for robustness.

    Authors: We agree that additional details on the experimental protocol are necessary for full reproducibility and assessment of robustness. In the revised manuscript, we will augment the relevant tables and figures with: (i) the exact data partitions used for optimization (source-domain validation split), (ii) the number of optimization runs performed, and (iii) error bars representing standard deviation across runs on the Pearson r values. We will also include a note on the absence of formal statistical tests, as the comparisons are primarily descriptive. revision: yes

Circularity Check

1 steps flagged

COF weights fitted by maximizing correlation to evaluation-set errors; reported superiority over RF is by construction on CelebDF

specific steps
  1. fitted input called prediction [Abstract (COF definition)]
    "fuses five complementary uncertainty sources -- epistemic, aleatoric, calibration, conformal, and distributional -- by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex"

    The optimization solves for weights that maximize the exact quantity (Pearson r to prediction errors) later reported as the method's performance on CelebDF. When performed on the same distribution used for the 9/11 architecture comparisons and shift results, the reported correlations are the fitted values by construction rather than out-of-sample predictions.

full rationale

The paper defines COF via constrained optimization that directly maximizes Pearson r between the linear fusion and observed prediction errors. The abstract and skeptic description give no evidence of a held-out optimization split separate from the CelebDF evaluation distribution on which the r values (0.249 vs 0.034, 74% drop) are reported. This reduces the central empirical claim to a fitted linear combination evaluated on its own training errors rather than an independent prediction. No other circular steps identified; the rest of the architecture comparison and cross-dataset degradation results stand on their own measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on fitted fusion weights and the assumption that the five uncertainty sources can be usefully combined; no independent evidence for the sources' complementarity is supplied beyond the optimization itself.

free parameters (1)
  • fusion weights on probability simplex
    Five non-negative weights summing to one, chosen by constrained optimization to maximize Pearson correlation with prediction errors.
axioms (1)
  • domain assumption The five uncertainty sources (epistemic, aleatoric, calibration, conformal, distributional) are complementary and admit a linear combination that improves correlation with errors.
    Invoked by the definition of COF as a fusion of these specific sources.

pith-pipeline@v0.9.1-grok · 5848 in / 1371 out tokens · 26537 ms · 2026-06-28T01:56:53.838688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Mesonet: a compact facial video forgery detection network,

    D. Afchar, V . Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial video forgery detection network,” inIEEE Int. Workshop Inf. Forensics Security, 2018, pp. 1–7. 13

  2. [2]

    FaceForensics++: Learning to detect manipulated facial images,

    A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “FaceForensics++: Learning to detect manipulated facial images,” inIEEE Int. Conf. Comput. Vis., 2019, pp. 1–11

  3. [3]

    Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,

    Y . Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” inInt. Conf. Mach. Learn.PMLR, 2016, pp. 1050–1059

  4. [4]

    On calibration of modern neural networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInt. Conf. Mach. Learn.PMLR, 2017, pp. 1321–1330

  5. [5]

    Thinking in frequency: Face forgery detection by mining frequency-aware clues,

    Y . Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in frequency: Face forgery detection by mining frequency-aware clues,” inECCV. Springer, 2020, pp. 86–103

  6. [6]

    Deepfake video detection using convolutional vision transformer,

    D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision transformer,” inarXiv preprint arXiv:2102.11126, 2021

  7. [7]

    Combining EfficientNet and vision transformers for video deepfake detection,

    D. A. Coccomini, N. Messina, C. Gennaro, and F. Falchi, “Combining EfficientNet and vision transformers for video deepfake detection,” in Image Analysis and Processing–ICIAP 2022. Springer, 2022, pp. 219– 229

  8. [8]

    EfficientNet: Rethinking model scaling for convo- lutional neural networks,

    M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convo- lutional neural networks,” inPMLR, vol. 97, 2019, pp. 6105–6114

  9. [9]

    Mintime: Multi-identity size-invariant video deepfake detection,

    D. A. Coccomini, G. K. Zilos, G. Amato, R. Caldelli, F. Falchi, S. Pa- padopoulos, and C. Gennaro, “Mintime: Multi-identity size-invariant video deepfake detection,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 6084–6096, 2024

  10. [10]

    Bi-stream coteaching network for weakly-supervised deepfake localization in videos,

    Z. Li, Z. Teng, B. Zhang, and J. Fan, “Bi-stream coteaching network for weakly-supervised deepfake localization in videos,”IEEE Trans. Inf. Forensics Security, vol. 20, pp. 1724–1738, 2025

  11. [11]

    Ddl: Effective and comprehensible interpretation framework for diverse deepfake detectors,

    Z. Sun, N. Ruan, and J. Li, “Ddl: Effective and comprehensible interpretation framework for diverse deepfake detectors,”IEEE Trans. Inf. Forensics Security, vol. 20, pp. 3601–3615, 2025

  12. [12]

    Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

    N. A. Chandraet al., “Deepfake-eval-2024: A multi-modal in-the- wild benchmark of deepfakes circulated in 2024,”arXiv preprint arXiv:2503.02857, 2025, available at: https://arxiv.org/abs/2503.02857

  13. [13]

    Cnn- generated images are surprisingly easy to spot. . . for now,

    S.-Y . Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn- generated images are surprisingly easy to spot. . . for now,”IEEE Conf. Comput. Vis. Pattern Recog., pp. 8692–8701, 2020

  14. [14]

    Improving generalization of deepfake detectors by imposing gradient regularization,

    W. Guan, W. Wang, J. Dong, and B. Peng, “Improving generalization of deepfake detectors by imposing gradient regularization,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 5345–5356, 2024

  15. [15]

    Weight uncertainty in neural networks,

    C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” inInt. Conf. Mach. Learn.PMLR, 2015, pp. 1613–1622

  16. [16]

    Simple and scalable predictive uncertainty estimation using deep ensembles,

    B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” inAdv. Neural Inf. Process. Syst., 2017, pp. 6402–6413

  17. [17]

    Evidential deep learning to quantify classification uncertainty,

    M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential deep learning to quantify classification uncertainty,” inAdv. Neural Inf. Process. Syst. Curran Associates Inc., 2018, pp. 3183–3193

  18. [18]

    Calibrating deep neural networks using focal loss,

    J. Mukhoti, V . Kulharia, A. Sanyal, S. Golodetz, P. H. S. Torr, and P. K. Dokania, “Calibrating deep neural networks using focal loss,” inAdv. Neural Inf. Process. Syst.Curran Associates Inc., 2020

  19. [19]

    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

    A. N. Angelopoulos and S. Bates, “A gentle introduction to confor- mal prediction and distribution-free uncertainty quantification,”arXiv preprint arXiv:2107.07511, 2021

  20. [20]

    Classification with valid and adaptive coverage,

    Y . Romano, M. Sesia, and E. J. Cand `es, “Classification with valid and adaptive coverage,” inAdv. Neural Inf. Process. Syst.Curran Associates Inc., 2020

  21. [21]

    A simple unified framework for detecting out-of-distribution samples and adversarial attacks,

    K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2018, pp. 7167–7177

  22. [22]

    Uncertainty-aware face embedding with contrastive learning for open-set evaluation,

    K. Ahn, S. Lee, S. Han, C. Y . Low, and M. Cha, “Uncertainty-aware face embedding with contrastive learning for open-set evaluation,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 7176–7186, 2024

  23. [23]

    Toward gener- alizable deepfake detection via forgery-aware audio–visual adaptation: A variational bayesian approach,

    F. Nie, J. Ni, J. Zhang, B. Zhang, W. Zhang, and B. Li, “Toward gener- alizable deepfake detection via forgery-aware audio–visual adaptation: A variational bayesian approach,”IEEE Trans. Inf. Forensics Security, vol. 21, pp. 2933–2946, 2026

  24. [24]

    Incremental pedestrian attribute recognition via dual uncertainty-aware pseudo-labeling,

    D. Li, Z. Zhang, C. Shan, and L. Wang, “Incremental pedestrian attribute recognition via dual uncertainty-aware pseudo-labeling,”IEEE Trans. Inf. Forensics Security, vol. 18, pp. 2622–2636, 2023

  25. [25]

    A baseline for detecting misclassified and out-of-distribution examples in neural networks,

    D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” inInt. Conf. Learn. Represent., 2017

  26. [26]

    Deep evidential regression,

    A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential regression,” inAdv. Neural Inf. Process. Syst.Curran Associates Inc., 2020

  27. [27]

    Advances in deepfake detection algorithms: Exploring fusion techniques in single and multi-modal approach,

    A. Kumar, D. Singh, R. Jain, D. K. Jain, C. Gan, and X. Zhao, “Advances in deepfake detection algorithms: Exploring fusion techniques in single and multi-modal approach,”Inf. Fusion, vol. 118, p. 102993, 2025

  28. [28]

    What uncertainties do we need in Bayesian deep learning for computer vision?

    A. Kendall and Y . Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?” inAdv. Neural Inf. Process. Syst., 2017, pp. 5574–5584

  29. [29]

    Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift,

    Y . Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V . Dillon, B. Lakshminarayanan, and J. Snoek, “Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift,” in Adv. Neural Inf. Process. Syst.Curran Associates Inc., 2019, pp. 13 991–14 002

  30. [30]

    Do- mainforensics: Exposing face forgery across domains via bi-directional adaptation,

    Q. Lv, Y . Li, J. Dong, S. Chen, H. Yu, H. Zhou, and S. Zhang, “Do- mainforensics: Exposing face forgery across domains via bi-directional adaptation,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 7275–7289, 2024

  31. [31]

    Fine-grained open-set deepfake detection via unsupervised domain adaptation,

    X. Zhou, H. Han, S. Shan, and X. Chen, “Fine-grained open-set deepfake detection via unsupervised domain adaptation,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 7536–7547, 2024

  32. [32]

    Rademacher and gaussian complex- ities: risk bounds and structural results,

    P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complex- ities: risk bounds and structural results,”J. Mach. Learn. Res., vol. 3, no. null, p. 463–482, Mar. 2003

  33. [33]

    Machine Learning , author =

    S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,”Mach. Learn., vol. 79, no. 1–2, p. 151–175, May 2010. [Online]. Available: https://doi.org/10.1007/s10994-009-5152-4

  34. [34]

    Celeb-DF: A large-scale challenging dataset for deepfake forensics,

    Y . Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A large-scale challenging dataset for deepfake forensics,” inIEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 3207–3216

  35. [35]

    The DeepFake Detection Challenge (DFDC) Dataset

    B. Dolhanskyet al., “The DeepFake Detection Challenge (DFDC) dataset,” inarXiv preprint arXiv:2006.07397, 2020

  36. [36]

    Xception: Deep learning with depthwise separable convolu- tions,

    F. Chollet, “Xception: Deep learning with depthwise separable convolu- tions,” inIEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 1251–1258

  37. [37]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inIEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770– 778

  38. [38]

    Efficientnetv2: Smaller models and faster training,

    M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” inPMLR, 18–24 Jul 2021, pp. 10 096–10 106. [Online]. Available: https://proceedings.mlr.press/v139/tan21a.html

  39. [39]

    An image is worth 16x16 words: Trans- formers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInt. Conf. Learn. Represent. OpenReview.net, 2021

  40. [40]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayroldes, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” inPMLR, vol. 139, 2021, pp. 10 347–10 357

  41. [41]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inIEEE Int. Conf. Comput. Vis., 2021, pp. 9992–10 002

  42. [42]

    A convnet for the 2020s

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “ A ConvNet for the 2020s ,” inIEEE Conf. Comput. Vis. Pattern Recog.Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2022, pp. 11 966–11 976. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/CVPR52688.2022.01167

  43. [43]

    Maxvit: Multi-axis vision transformer,

    Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y . Li, “Maxvit: Multi-axis vision transformer,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Ciss´e, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 459–479

  44. [44]

    Think twice before adaptation: improving adaptability of deepfake detection via online test-time adaptation,

    H.-H. Nguyen-Le, V .-T. Tran, D.-T. Nguyen, and N.-A. Le-Khac, “Think twice before adaptation: improving adaptability of deepfake detection via online test-time adaptation,” inProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, ser. IJCAI ’25,

  45. [45]

    Available: https://doi.org/10.24963/ijcai.2025/854

    [Online]. Available: https://doi.org/10.24963/ijcai.2025/854