pith. machine review for the scientific record. sign in

arxiv: 2605.03390 · v1 · submitted 2026-05-05 · 💻 cs.CV · cs.AI· cs.MM

Recognition: unknown

Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords self-supervised learningtalking head forgery detectiondual-system frameworkanomaly detectiontraining-freeforgery detectionvideo forensicsdeepfake detection
0
0 comments X

The pith

Existing self-supervised talking head forgery detectors can be improved without training by using a dual-system framework that refines anomaly scores on uncertain samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that self-supervised methods for detecting forged talking heads generalize better across generators than supervised ones but still struggle with hard cases where anomaly scores give unreliable orderings. Drawing from dual-process cognition, it introduces a training-free framework that uses the raw scores for quick decisions on clear cases and then applies detailed evidence-based reasoning only to ambiguous ones. This selectively corrects the relative rankings within the uncertain group, leading to better overall detection performance on various datasets and under different perturbations. The approach demonstrates that current detectors hold more useful information than is currently extracted, which can be unlocked through this lightweight routing and refinement process.

Core claim

By modeling anomaly-like scores as the fast System-1 and restricting fine-grained evidence-guided reasoning to the uncertain subset identified by threshold routing, the Training-Free Dual-System framework refines the relative ordering of ambiguous samples, yielding consistent gains in forgery detection metrics that stem primarily from improved discrimination within the hard cases.

What carries the argument

The TFDS framework's threshold-based routing that partitions inputs into confident and uncertain subsets based on anomaly scores, combined with evidence-guided reasoning applied only to the uncertain subset to adjust their ordering.

If this is right

  • Consistent improvements in detection performance across multiple datasets and perturbation settings.
  • The performance gains come mainly from corrected ordering of samples in the uncertain subset.
  • Existing self-supervised detectors contain underexploited discriminative cues that training-free methods can access.
  • Reducing reliance on generator-specific patterns while enhancing discriminative capacity on ambiguous cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be tested on other self-supervised anomaly detection tasks in video or image forensics to see if similar gains occur.
  • Future work might explore combining the dual-system with other post-hoc techniques to further exploit model capacities.
  • The routing thresholds might need tuning per detector, suggesting a direction for adaptive variants without full retraining.

Load-bearing premise

The anomaly scores from existing self-supervised detectors are reliable enough to accurately partition samples into confident and uncertain subsets, allowing the evidence-guided reasoning to improve ordering without adding errors.

What would settle it

Applying the TFDS framework to several self-supervised talking head forgery detectors and finding no measurable increase in detection accuracy or AUC, particularly no improvement in the ranking of uncertain samples.

Figures

Figures reproduced from arXiv: 2605.03390 by Jiwei Wei, Ke Liu, Ruikun Chai, Shuchang Zhou, Yang Yang, Yitong Qin, Yutong Xiao, Yuyang Zhou.

Figure 1
Figure 1. Figure 1: The main detection difficulty is concentrated on the un view at source ↗
Figure 2
Figure 2. Figure 2: System-1 for uncertainty routing. Labels are used only on view at source ↗
Figure 3
Figure 3. Figure 3: Overview of System-2 for fine-grained evidence-guided reasoning and slot-preserving refinement. (1) For an uncertain video, frozen view at source ↗
Figure 5
Figure 5. Figure 5: Rank displacement of uncertain samples on AVLips (left) view at source ↗
Figure 4
Figure 4. Figure 4: Paired percentile ranks assigned by the official AVH-Align view at source ↗
read the original abstract

Supervised talking head forgery detection faces severe generalization challenges due to the continuous evolution of generators. By reducing reliance on generator-specific forgery patterns, self-supervised detectors offer stronger cross-generator robustness. However, existing research has mainly focused on building stronger detectors, while the discriminative capacity of trained detectors remains insufficiently exploited. In particular, for score-based self-supervised detectors, the limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering, leaving room for further refinement. Motivated by this observation, we draw inspiration from the dual-system theory of human cognition and propose a Training-Free Dual-System (TFDS) framework to further exploit the latent discriminative capacity of existing score-based self-supervised detectors. TFDS treats anomaly-like scores as the basis of System-1, using lightweight threshold-based routing to partition samples into confident and uncertain subsets. System-2 then revisits only the uncertain subset, performing fine-grained evidence-guided reasoning to refine the relative ordering of ambiguous samples within the original score distribution. Extensive experiments demonstrate consistent improvements across datasets and perturbation settings, with the gains arising mainly from corrected ordering within the uncertain subset. These findings show that existing self-supervised talking head forgery detectors still contain underexploited discriminative cues that can be effectively unlocked through training-free dual-system reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Training-Free Dual-System (TFDS) framework for self-supervised talking head forgery detection. It treats anomaly-like scores from existing detectors as System-1, applies lightweight threshold-based routing to split samples into confident and uncertain subsets, and uses System-2 for evidence-guided reasoning only on the uncertain subset to refine relative ordering within the original score distribution. The central claim is that this unlocks latent discriminative capacity in pre-trained detectors, yielding consistent improvements across datasets and perturbations without any training or fine-tuning, with gains attributed primarily to corrected ordering on ambiguous samples.

Significance. If validated with detailed evidence, the result would show that existing score-based self-supervised detectors retain underexploited cues that can be accessed via a simple, training-free dual-process mechanism. This is a practical strength for cross-generator robustness in forgery detection, as it avoids the need for new labeled data or retraining while targeting the known weakness on hard cases. The training-free nature and focus on refining uncertain samples are positive features that could be adopted as a post-processing step for other anomaly-based detectors.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts 'consistent improvements across datasets and perturbation settings' and states that 'gains arising mainly from corrected ordering within the uncertain subset,' yet supplies no quantitative metrics, ablation results, or implementation details for the evidence-guided reasoning step in System-2. This omission prevents verification that the data support the central claim that improvements are due to the proposed routing and refinement mechanism rather than other factors.
  2. [Abstract] The framework description (abstract and §3 implied): The routing step assumes anomaly-like scores from self-supervised detectors are sufficiently reliable to partition samples via a lightweight threshold into confident vs. uncertain subsets. However, the abstract itself notes that 'limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering,' which creates a direct risk that hard samples are misrouted, either bypassing refinement or receiving it inappropriately; this threatens attribution of any observed gains to the dual-system design.
minor comments (2)
  1. The single free parameter (routing threshold) is noted but its selection procedure and sensitivity analysis are not described, which would aid reproducibility.
  2. Clarify the precise form of 'evidence-guided reasoning' in System-2 (e.g., what constitutes evidence and how it adjusts the original scores) to make the method fully reproducible from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, offering clarifications based on the content of the full paper and indicating revisions where appropriate to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts 'consistent improvements across datasets and perturbation settings' and states that 'gains arising mainly from corrected ordering within the uncertain subset,' yet supplies no quantitative metrics, ablation results, or implementation details for the evidence-guided reasoning step in System-2. This omission prevents verification that the data support the central claim that improvements are due to the proposed routing and refinement mechanism rather than other factors.

    Authors: We acknowledge that the abstract's brevity precludes inclusion of specific metrics or implementation details. The full manuscript (Sections 4 and 5) provides these elements, including quantitative AUC improvements across multiple datasets and perturbation settings, along with ablations isolating the contribution of the evidence-guided reasoning in System-2. To improve verifiability directly from the abstract, we will revise it to incorporate concise references to the key performance gains and a high-level description of the System-2 reasoning step. revision: yes

  2. Referee: [Abstract] The framework description (abstract and §3 implied): The routing step assumes anomaly-like scores from self-supervised detectors are sufficiently reliable to partition samples via a lightweight threshold into confident vs. uncertain subsets. However, the abstract itself notes that 'limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering,' which creates a direct risk that hard samples are misrouted, either bypassing refinement or receiving it inappropriately; this threatens attribution of any observed gains to the dual-system design.

    Authors: We appreciate this observation on the potential for routing errors. The framework intentionally routes based on score extremity to flag ambiguous cases for refinement, recognizing that extreme scores tend to be more reliable while mid-range scores indicate uncertainty. The paper's experiments (Section 4) show that observed gains are localized to the uncertain subset after refinement, supporting attribution to the dual-system approach rather than artifacts of routing. We will add a dedicated discussion paragraph in Section 3 to explicitly analyze routing robustness and the impact of potential misrouting on overall performance. revision: partial

Circularity Check

0 steps flagged

No circularity: training-free post-processing on external detector scores

full rationale

The TFDS framework applies threshold routing and evidence-guided reordering to anomaly scores produced by pre-existing self-supervised detectors. No parameters are fitted to the evaluation data, no equations reduce the output ordering to the input scores by algebraic identity, and no load-bearing premise rests on self-citation of the authors' prior uniqueness results. The derivation chain consists of an external cognitive analogy plus lightweight heuristics whose correctness is tested empirically rather than assumed by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing detector scores can be meaningfully partitioned and refined without training, plus the ad-hoc choice of routing thresholds.

free parameters (1)
  • routing threshold
    Lightweight threshold-based routing requires at least one threshold parameter to separate confident from uncertain samples.
axioms (1)
  • domain assumption Anomaly-like scores from self-supervised detectors provide a usable basis for initial confident/uncertain partitioning.
    The entire routing step depends on this property holding for the base detectors.

pith-pipeline@v0.9.0 · 5550 in / 1315 out tokens · 100200 ms · 2026-05-08T01:19:06.891214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  2. [2]

    Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2021. Audio-visual synchronisation in the wild.arXiv preprint arXiv:2112.04432(2021)

  3. [3]

    Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, Pengfei Wan, Di ZHANG, Shuai Li, et al. 2025. Cafe-Talk: Generating 3D Talk- ing Face Animation with Multimodal Coarse-and Fine-grained Control. InThe Thirteenth International Conference on Learning Representations

  4. [4]

    Sungik Choi, Hankook Lee, and Moontae Lee. 2025. Training-free Detection of AI-generated images via Cropping Robustness.arXiv preprint arXiv:2511.14030 (2025)

  5. [5]

    Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. 2020. Not made for each other-audio-visual dissonance-based deepfake detection and localization. InProceedings of the 28th ACM international conference on multimedia. 439–447

  6. [6]

    Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. 2022. Combining efficientnet and vision transformers for video deepfake detection. InInternational conference on image analysis and processing. Springer, 219–229

  7. [7]

    Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. 2024. Zero-shot detection of ai-generated images. InEuropean conference on computer vision. Springer, 54–72

  8. [8]

    Biao Dong and Lei Zhang. 2025. Talking Head Generation via Viewpoint and Lighting Simulation Based on Global Representation. InProceedings of the 33rd ACM International Conference on Multimedia. 10258–10267

  9. [9]

    Chao Feng, Ziyang Chen, and Andrew Owens. 2023. Self-supervised video forensics by audio-visual anomaly detection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10491–10503

  10. [10]

    Ronen Fluss, David Faraggi, and Benjamin Reiser. 2005. Estimation of the Youden Index and its associated cutoff point.Biometrical Journal: Journal of Mathematical Methods in Biosciences47, 4 (2005), 458–472

  11. [11]

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2024. Clip-adapter: Better vision-language models with feature adapters.International journal of computer vision132, 2 (2024), 581–595

  12. [12]

    Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, and Zhengqi Wen. 2025. Allm4add: Unlocking the capabilities of audio large language models for audio deepfake detection. InProceedings of the 33rd ACM International Conference on Multimedia. 11736–11745

  13. [13]

    Midou Guo, Qilin Yin, Wei Lu, and Xiangyang Luo. 2025. Towards open-world generalized deepfake detection: General feature extraction via unsupervised domain adaptation. InProceedings of the 33rd ACM International Conference on Multimedia. 11572–11580

  14. [14]

    Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. 2022. Leveraging real talking faces via self-supervision for robust forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14950–14962

  15. [15]

    Zhiyuan He, Pin-Yu Chen, and Tsung-Yi Ho. 2024. Rigid: A training-free and model-agnostic framework for robust ai-generated image detection.arXiv preprint arXiv:2405.20112(2024)

  16. [16]

    Baojin Huang, Zhongyuan Wang, Jifan Yang, Jiaxin Ai, Qin Zou, Qian Wang, and Dengpan Ye. 2023. Implicit identity driven deepfake face swapping detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4490–4499

  17. [17]

    Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guangliang Cheng. 2025. Sida: Social media image deepfake detection, localization and explanation with large multimodal model. InProceedings of the Computer Vision and Pattern Recognition Conference. 28831–28841

  18. [18]

    Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. 2024. Efficient test-time adaptation of vision-language models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14162–14171

  19. [19]

    Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. 2021. FakeAVCeleb: A novel audio-video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080 (2021)

  20. [20]

    Ivan Kukanov and Jun Wah Ng. 2025. KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features. InProceedings of the 33rd ACM International Conference on Multimedia. 13707–13713

  21. [21]

    Jinyuan Li, Han Li, Di Sun, Jiahao Wang, Wenkun Zhang, Zan Wang, and Gang Pan. 2024. LLMs as bridges: Reformulating grounded multimodal named entity recognition. InFindings of the Association for Computational Linguistics: ACL

  22. [22]

    Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, and Dong Wang. 2024. Zero-shot fake video detection by audio-visual consistency.arXiv preprint arXiv:2406.07854(2024)

  23. [23]

    Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, and Run Wang. 2024. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems37 (2024), 91131–91155

  24. [24]

    Yang Liu, Zhaoyang Xia, Mengyang Zhao, Donglai Wei, Yuzheng Wang, Siao Liu, Bobo Ju, Gaoyun Fang, Jing Liu, and Liang Song. 2023. Learning causality- inspired representation consistency for video anomaly detection. InProceedings of the 31st ACM international conference on multimedia. 203–212

  25. [25]

    Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models.arXiv preprint arXiv:2210.07183(2022)

  26. [26]

    Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions don’t lie: An audio-visual deepfake detection method using affective cues. InProceedings of the 28th ACM international conference on multimedia. 2823–2832

  27. [27]

    Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, and Yixuan Li. 2025. Under- standing Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach. InInternational Conference on Machine Learning. PMLR, 46943–46970

  28. [28]

    Ziqiao Peng, Wentao Hu, Yue Shi, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Jun He, Hongyan Liu, and Zhaoxin Fan. 2024. Synctalk: The devil is in the synchro- nization for talking head synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 666–676

  29. [29]

    Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. 2023. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF international conference on computer vision. 15691– 15701

  30. [30]

    Zhuang Qi, Pan Yu, Lei Meng, Sijin Zhou, Han Yu, Xiaoxiao Li, and Xiangxu Meng. 2025. Global prompt refinement with non-interfering attention masking for one-shot federated learning.arXiv preprint arXiv:2509.22700(2025)

  31. [31]

    Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, and Gang Xiong. 2025. Proapo: Progressively automatic prompt optimiza- tion for visual classification. InProceedings of the Computer Vision and Pattern Recognition Conference. 25145–25155

  32. [32]

    Yassine Rachidy, Youssef Hmamouche, Faissal Sehbaoui, and Amal El Fallah Seghrouchni. 2025. Domain Adaptive Document Reranking for Retrieval Aug- mented Generation. In2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 1381–1387

  33. [33]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763. MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Ke Liu,...

  34. [34]

    Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. 2023. ChatGPT-powered hierarchical comparisons for image classification.Advances in neural information processing systems36 (2023), 69706–69718

  35. [35]

    Jonas Ricker, Denis Lukovnikov, and Asja Fischer. 2024. Aeroblade: Training-free detection of latent diffusion images using autoencoder reconstruction error. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9130–9140

  36. [36]

    Katharine Sanderson. 2023. GPT-4 is here: what scientists think.Nature615, 7954 (2023), 773

  37. [37]

    Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster pre- diction.arXiv preprint arXiv:2201.02184(2022)

  38. [38]

    Stefan Smeu, Dragos-Alexandru Boldisor, Dan Oneata, and Elisabeta Oneata. 2025. Circumventing shortcuts in audio-visual deepfake detection datasets with unsu- pervised learning. InProceedings of the Computer Vision and Pattern Recognition Conference. 18815–18825

  39. [39]

    Chung-Ting Tsai, Ching-Yun Ko, I Chung, Yu-Chiang Frank Wang, Pin-Yu Chen, et al. 2024. Understanding and improving training-free ai-generated image detections with vision foundation models.arXiv preprint arXiv:2411.19117(2024)

  40. [40]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

  41. [41]

    Tianyi Wang, Mengxiao Huang, Harry Cheng, Xiao Zhang, and Zhiqi Shen

  42. [42]

    InProceedings of the 32nd ACM International Conference on Multimedia

    Lampmark: Proactive deepfake detection via training-free landmark per- ceptual watermarks. InProceedings of the 32nd ACM International Conference on Multimedia. 10515–10524

  43. [43]

    Jiwei Wei, Yang Yang, Xing Xu, Jingkuan Song, Guoqing Wang, and Heng Tao Shen. 2023. Less is better: Exponential loss for cross-modal matching.IEEE Transactions on Circuits and Systems for Video Technology33, 9 (2023), 5271–5280

  44. [44]

    Deressa Wodajo and Solomon Atnafu. 2021. Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126(2021)

  45. [45]

    Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, and Roni Sengupta. 2026. Talkingheadbench: A multi-modal benchmark & analysis of talking-head deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4139–4149

  46. [46]

    Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. 2023. Avoid-df: Audio-visual joint learning for detecting deepfake.IEEE Transactions on Information Forensics and Security18 (2023), 2015–2029

  47. [47]

    Yutong Yang, Lifu Huang, Yijie Lin, Xi Peng, and Mouxing Yang. 2026. Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recog- nition. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 11802–11810

  48. [48]

    Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, et al. 2024. Gaussiantalker: Speaker-specific talking head synthesis via 3d gaussian splatting. InProceedings of the 32nd ACM International Conference on Multimedia. 3548–3557

  49. [49]

    Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, and Chip Hong Chang

  50. [50]

    Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection.arXiv preprint arXiv:2503.14853(2025)

  51. [51]

    Zhaoyang Zeng, Daniel McDuff, Yale Song, et al. 2021. Contrastive learning of global and local video representations.Advances in Neural Information Processing Systems34 (2021), 7025–7040

  52. [52]

    Duzhen Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Xiuyi Chen, Yingying Zhang, et al. 2025. From system 1 to system 2: a survey of reasoning large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  53. [53]

    Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Tip-adapter: Training-free adaption of clip for few-shot classification. InEuropean conference on computer vision. Springer, 493–510

  54. [54]

    Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, and Lei Zhang

  55. [55]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Dual memory networks: A versatile adaptation approach for vision- language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 28718–28728

  56. [56]

    Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. 2021. Ex- ploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF international conference on computer vision. 15044– 15054

  57. [57]

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models.International journal of computer vision 130, 9 (2022), 2337–2348

  58. [58]

    Yipin Zhou and Ser-Nam Lim. 2021. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision. 14800– 14809