arxiv: 2605.03390 · v1 · submitted 2026-05-05 · 💻 cs.CV · cs.AI· cs.MM

Recognition: unknown

Enhancing Self-Supervised Talking Head Forgery Detection via a Training-Free Dual-System Framework

Ke Liu , Jiwei Wei , Shuchang Zhou , Yutong Xiao , Ruikun Chai , Yitong Qin , Yuyang Zhou , Yang Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords self-supervised learningtalking head forgery detectiondual-system frameworkanomaly detectiontraining-freeforgery detectionvideo forensicsdeepfake detection

0 comments

The pith

Existing self-supervised talking head forgery detectors can be improved without training by using a dual-system framework that refines anomaly scores on uncertain samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that self-supervised methods for detecting forged talking heads generalize better across generators than supervised ones but still struggle with hard cases where anomaly scores give unreliable orderings. Drawing from dual-process cognition, it introduces a training-free framework that uses the raw scores for quick decisions on clear cases and then applies detailed evidence-based reasoning only to ambiguous ones. This selectively corrects the relative rankings within the uncertain group, leading to better overall detection performance on various datasets and under different perturbations. The approach demonstrates that current detectors hold more useful information than is currently extracted, which can be unlocked through this lightweight routing and refinement process.

Core claim

By modeling anomaly-like scores as the fast System-1 and restricting fine-grained evidence-guided reasoning to the uncertain subset identified by threshold routing, the Training-Free Dual-System framework refines the relative ordering of ambiguous samples, yielding consistent gains in forgery detection metrics that stem primarily from improved discrimination within the hard cases.

What carries the argument

The TFDS framework's threshold-based routing that partitions inputs into confident and uncertain subsets based on anomaly scores, combined with evidence-guided reasoning applied only to the uncertain subset to adjust their ordering.

If this is right

Consistent improvements in detection performance across multiple datasets and perturbation settings.
The performance gains come mainly from corrected ordering of samples in the uncertain subset.
Existing self-supervised detectors contain underexploited discriminative cues that training-free methods can access.
Reducing reliance on generator-specific patterns while enhancing discriminative capacity on ambiguous cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be tested on other self-supervised anomaly detection tasks in video or image forensics to see if similar gains occur.
Future work might explore combining the dual-system with other post-hoc techniques to further exploit model capacities.
The routing thresholds might need tuning per detector, suggesting a direction for adaptive variants without full retraining.

Load-bearing premise

The anomaly scores from existing self-supervised detectors are reliable enough to accurately partition samples into confident and uncertain subsets, allowing the evidence-guided reasoning to improve ordering without adding errors.

What would settle it

Applying the TFDS framework to several self-supervised talking head forgery detectors and finding no measurable increase in detection accuracy or AUC, particularly no improvement in the ranking of uncertain samples.

Figures

Figures reproduced from arXiv: 2605.03390 by Jiwei Wei, Ke Liu, Ruikun Chai, Shuchang Zhou, Yang Yang, Yitong Qin, Yutong Xiao, Yuyang Zhou.

**Figure 1.** Figure 1: The main detection difficulty is concentrated on the un view at source ↗

**Figure 2.** Figure 2: System-1 for uncertainty routing. Labels are used only on view at source ↗

**Figure 3.** Figure 3: Overview of System-2 for fine-grained evidence-guided reasoning and slot-preserving refinement. (1) For an uncertain video, frozen view at source ↗

**Figure 5.** Figure 5: Rank displacement of uncertain samples on AVLips (left) view at source ↗

**Figure 4.** Figure 4: Paired percentile ranks assigned by the official AVH-Align view at source ↗

read the original abstract

Supervised talking head forgery detection faces severe generalization challenges due to the continuous evolution of generators. By reducing reliance on generator-specific forgery patterns, self-supervised detectors offer stronger cross-generator robustness. However, existing research has mainly focused on building stronger detectors, while the discriminative capacity of trained detectors remains insufficiently exploited. In particular, for score-based self-supervised detectors, the limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering, leaving room for further refinement. Motivated by this observation, we draw inspiration from the dual-system theory of human cognition and propose a Training-Free Dual-System (TFDS) framework to further exploit the latent discriminative capacity of existing score-based self-supervised detectors. TFDS treats anomaly-like scores as the basis of System-1, using lightweight threshold-based routing to partition samples into confident and uncertain subsets. System-2 then revisits only the uncertain subset, performing fine-grained evidence-guided reasoning to refine the relative ordering of ambiguous samples within the original score distribution. Extensive experiments demonstrate consistent improvements across datasets and perturbation settings, with the gains arising mainly from corrected ordering within the uncertain subset. These findings show that existing self-supervised talking head forgery detectors still contain underexploited discriminative cues that can be effectively unlocked through training-free dual-system reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's training-free dual-system tweak is a simple post-hoc idea worth checking, but it stands or falls on whether the base anomaly scores can actually route the right samples.

read the letter

The main thing to know is that this work takes existing self-supervised talking head forgery detectors and adds a training-free layer that routes samples by their anomaly scores into confident and uncertain groups, then applies extra evidence-guided reasoning only to the uncertain ones to fix their relative ordering. It claims the gains come mostly from that refinement step and shows up consistently across datasets and perturbations in the experiments they ran. That is the concrete contribution: treating the detectors' own outputs as a starting point for dual-process style correction instead of retraining everything from scratch. It is practical on the surface because no new labels or generator-specific fine-tuning are required. The authors are right that self-supervised methods still leave exploitable signal on hard cases, and exploiting it without training is a reasonable direction. The soft spot is exactly the one the stress-test flags. The routing depends on the anomaly scores being reliable enough to separate easy from hard samples in the first place, yet the abstract itself notes that hard cases produce unreliable ordering. If the threshold misroutes difficult forgeries, the whole refinement step either gets skipped or gets applied to the wrong subset, and any reported improvement could be fragile. The paper would need to show clear ablations on the threshold, the exact form of the evidence-guided reasoning, and that the corrected ordering beats simple alternatives like score recalibration. Without those, it is hard to attribute the gains to the proposed mechanism rather than to the specific datasets or the choice of base detector. This is for researchers and practitioners in deepfake detection who already have a working self-supervised model and want an inference-time lever. A reader who cares about cross-generator robustness and lightweight deployment would get value from seeing whether the routing actually isolates the right cases. It deserves a serious referee because the idea is testable, the experiments are described as extensive, and the central claim can be checked with the right controls. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Training-Free Dual-System (TFDS) framework for self-supervised talking head forgery detection. It treats anomaly-like scores from existing detectors as System-1, applies lightweight threshold-based routing to split samples into confident and uncertain subsets, and uses System-2 for evidence-guided reasoning only on the uncertain subset to refine relative ordering within the original score distribution. The central claim is that this unlocks latent discriminative capacity in pre-trained detectors, yielding consistent improvements across datasets and perturbations without any training or fine-tuning, with gains attributed primarily to corrected ordering on ambiguous samples.

Significance. If validated with detailed evidence, the result would show that existing score-based self-supervised detectors retain underexploited cues that can be accessed via a simple, training-free dual-process mechanism. This is a practical strength for cross-generator robustness in forgery detection, as it avoids the need for new labeled data or retraining while targeting the known weakness on hard cases. The training-free nature and focus on refining uncertain samples are positive features that could be adopted as a post-processing step for other anomaly-based detectors.

major comments (2)

[Abstract] Abstract: The abstract asserts 'consistent improvements across datasets and perturbation settings' and states that 'gains arising mainly from corrected ordering within the uncertain subset,' yet supplies no quantitative metrics, ablation results, or implementation details for the evidence-guided reasoning step in System-2. This omission prevents verification that the data support the central claim that improvements are due to the proposed routing and refinement mechanism rather than other factors.
[Abstract] The framework description (abstract and §3 implied): The routing step assumes anomaly-like scores from self-supervised detectors are sufficiently reliable to partition samples via a lightweight threshold into confident vs. uncertain subsets. However, the abstract itself notes that 'limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering,' which creates a direct risk that hard samples are misrouted, either bypassing refinement or receiving it inappropriately; this threatens attribution of any observed gains to the dual-system design.

minor comments (2)

The single free parameter (routing threshold) is noted but its selection procedure and sensitivity analysis are not described, which would aid reproducibility.
Clarify the precise form of 'evidence-guided reasoning' in System-2 (e.g., what constitutes evidence and how it adjusts the original scores) to make the method fully reproducible from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, offering clarifications based on the content of the full paper and indicating revisions where appropriate to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts 'consistent improvements across datasets and perturbation settings' and states that 'gains arising mainly from corrected ordering within the uncertain subset,' yet supplies no quantitative metrics, ablation results, or implementation details for the evidence-guided reasoning step in System-2. This omission prevents verification that the data support the central claim that improvements are due to the proposed routing and refinement mechanism rather than other factors.

Authors: We acknowledge that the abstract's brevity precludes inclusion of specific metrics or implementation details. The full manuscript (Sections 4 and 5) provides these elements, including quantitative AUC improvements across multiple datasets and perturbation settings, along with ablations isolating the contribution of the evidence-guided reasoning in System-2. To improve verifiability directly from the abstract, we will revise it to incorporate concise references to the key performance gains and a high-level description of the System-2 reasoning step. revision: yes
Referee: [Abstract] The framework description (abstract and §3 implied): The routing step assumes anomaly-like scores from self-supervised detectors are sufficiently reliable to partition samples via a lightweight threshold into confident vs. uncertain subsets. However, the abstract itself notes that 'limited discriminative ability on hard cases is often reflected in unreliable anomaly ordering,' which creates a direct risk that hard samples are misrouted, either bypassing refinement or receiving it inappropriately; this threatens attribution of any observed gains to the dual-system design.

Authors: We appreciate this observation on the potential for routing errors. The framework intentionally routes based on score extremity to flag ambiguous cases for refinement, recognizing that extreme scores tend to be more reliable while mid-range scores indicate uncertainty. The paper's experiments (Section 4) show that observed gains are localized to the uncertain subset after refinement, supporting attribution to the dual-system approach rather than artifacts of routing. We will add a dedicated discussion paragraph in Section 3 to explicitly analyze routing robustness and the impact of potential misrouting on overall performance. revision: partial

Circularity Check

0 steps flagged

No circularity: training-free post-processing on external detector scores

full rationale

The TFDS framework applies threshold routing and evidence-guided reordering to anomaly scores produced by pre-existing self-supervised detectors. No parameters are fitted to the evaluation data, no equations reduce the output ordering to the input scores by algebraic identity, and no load-bearing premise rests on self-citation of the authors' prior uniqueness results. The derivation chain consists of an external cognitive analogy plus lightweight heuristics whose correctness is tested empirically rather than assumed by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing detector scores can be meaningfully partitioned and refined without training, plus the ad-hoc choice of routing thresholds.

free parameters (1)

routing threshold
Lightweight threshold-based routing requires at least one threshold parameter to separate confident from uncertain samples.

axioms (1)

domain assumption Anomaly-like scores from self-supervised detectors provide a usable basis for initial confident/uncertain partitioning.
The entire routing step depends on this property holding for the base detectors.

pith-pipeline@v0.9.0 · 5550 in / 1315 out tokens · 100200 ms · 2026-05-08T01:19:06.891214+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020
[2]

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2021. Audio-visual synchronisation in the wild.arXiv preprint arXiv:2112.04432(2021)

work page arXiv 2021
[3]

Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, Pengfei Wan, Di ZHANG, Shuai Li, et al. 2025. Cafe-Talk: Generating 3D Talk- ing Face Animation with Multimodal Coarse-and Fine-grained Control. InThe Thirteenth International Conference on Learning Representations

2025
[4]

Sungik Choi, Hankook Lee, and Moontae Lee. 2025. Training-free Detection of AI-generated images via Cropping Robustness.arXiv preprint arXiv:2511.14030 (2025)

work page arXiv 2025
[5]

Komal Chugh, Parul Gupta, Abhinav Dhall, and Ramanathan Subramanian. 2020. Not made for each other-audio-visual dissonance-based deepfake detection and localization. InProceedings of the 28th ACM international conference on multimedia. 439–447

2020
[6]

Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. 2022. Combining efficientnet and vision transformers for video deepfake detection. InInternational conference on image analysis and processing. Springer, 219–229

2022
[7]

Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. 2024. Zero-shot detection of ai-generated images. InEuropean conference on computer vision. Springer, 54–72

2024
[8]

Biao Dong and Lei Zhang. 2025. Talking Head Generation via Viewpoint and Lighting Simulation Based on Global Representation. InProceedings of the 33rd ACM International Conference on Multimedia. 10258–10267

2025
[9]

Chao Feng, Ziyang Chen, and Andrew Owens. 2023. Self-supervised video forensics by audio-visual anomaly detection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10491–10503

2023
[10]

Ronen Fluss, David Faraggi, and Benjamin Reiser. 2005. Estimation of the Youden Index and its associated cutoff point.Biometrical Journal: Journal of Mathematical Methods in Biosciences47, 4 (2005), 458–472

2005
[11]

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2024. Clip-adapter: Better vision-language models with feature adapters.International journal of computer vision132, 2 (2024), 581–595

2024
[12]

Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, and Zhengqi Wen. 2025. Allm4add: Unlocking the capabilities of audio large language models for audio deepfake detection. InProceedings of the 33rd ACM International Conference on Multimedia. 11736–11745

2025
[13]

Midou Guo, Qilin Yin, Wei Lu, and Xiangyang Luo. 2025. Towards open-world generalized deepfake detection: General feature extraction via unsupervised domain adaptation. InProceedings of the 33rd ACM International Conference on Multimedia. 11572–11580

2025
[14]

Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. 2022. Leveraging real talking faces via self-supervision for robust forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14950–14962

2022
[15]

Zhiyuan He, Pin-Yu Chen, and Tsung-Yi Ho. 2024. Rigid: A training-free and model-agnostic framework for robust ai-generated image detection.arXiv preprint arXiv:2405.20112(2024)

work page arXiv 2024
[16]

Baojin Huang, Zhongyuan Wang, Jifan Yang, Jiaxin Ai, Qin Zou, Qian Wang, and Dengpan Ye. 2023. Implicit identity driven deepfake face swapping detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4490–4499

2023
[17]

Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guangliang Cheng. 2025. Sida: Social media image deepfake detection, localization and explanation with large multimodal model. InProceedings of the Computer Vision and Pattern Recognition Conference. 28831–28841

2025
[18]

Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, and Eric Xing. 2024. Efficient test-time adaptation of vision-language models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14162–14171

2024
[19]

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. 2021. FakeAVCeleb: A novel audio-video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080 (2021)

work page arXiv 2021
[20]

Ivan Kukanov and Jun Wah Ng. 2025. KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features. InProceedings of the 33rd ACM International Conference on Multimedia. 13707–13713

2025
[21]

Jinyuan Li, Han Li, Di Sun, Jiahao Wang, Wenkun Zhang, Zan Wang, and Gang Pan. 2024. LLMs as bridges: Reformulating grounded multimodal named entity recognition. InFindings of the Association for Computational Linguistics: ACL

2024
[22]

Xiaolou Li, Zehua Liu, Chen Chen, Lantian Li, Li Guo, and Dong Wang. 2024. Zero-shot fake video detection by audio-visual consistency.arXiv preprint arXiv:2406.07854(2024)

work page arXiv 2024
[23]

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, and Run Wang. 2024. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes.Advances in Neural Information Processing Systems37 (2024), 91131–91155

2024
[24]

Yang Liu, Zhaoyang Xia, Mengyang Zhao, Donglai Wei, Yuzheng Wang, Siao Liu, Bobo Ju, Gaoyun Fang, Jing Liu, and Liang Song. 2023. Learning causality- inspired representation consistency for video anomaly detection. InProceedings of the 31st ACM international conference on multimedia. 203–212

2023
[25]

Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models.arXiv preprint arXiv:2210.07183(2022)

work page arXiv 2022
[26]

Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions don’t lie: An audio-visual deepfake detection method using affective cues. InProceedings of the 28th ACM international conference on multimedia. 2823–2832

2020
[27]

Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, and Yixuan Li. 2025. Under- standing Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach. InInternational Conference on Machine Learning. PMLR, 46943–46970

2025
[28]

Ziqiao Peng, Wentao Hu, Yue Shi, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Jun He, Hongyan Liu, and Zhaoxin Fan. 2024. Synctalk: The devil is in the synchro- nization for talking head synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 666–676

2024
[29]

Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. 2023. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF international conference on computer vision. 15691– 15701

2023
[30]

Zhuang Qi, Pan Yu, Lei Meng, Sijin Zhou, Han Yu, Xiaoxiao Li, and Xiangxu Meng. 2025. Global prompt refinement with non-interfering attention masking for one-shot federated learning.arXiv preprint arXiv:2509.22700(2025)

work page arXiv 2025
[31]

Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, and Gang Xiong. 2025. Proapo: Progressively automatic prompt optimiza- tion for visual classification. InProceedings of the Computer Vision and Pattern Recognition Conference. 25145–25155

2025
[32]

Yassine Rachidy, Youssef Hmamouche, Faissal Sehbaoui, and Amal El Fallah Seghrouchni. 2025. Domain Adaptive Document Reranking for Retrieval Aug- mented Generation. In2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 1381–1387

2025
[33]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763. MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Ke Liu,...

2021
[34]

Zhiyuan Ren, Yiyang Su, and Xiaoming Liu. 2023. ChatGPT-powered hierarchical comparisons for image classification.Advances in neural information processing systems36 (2023), 69706–69718

2023
[35]

Jonas Ricker, Denis Lukovnikov, and Asja Fischer. 2024. Aeroblade: Training-free detection of latent diffusion images using autoencoder reconstruction error. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9130–9140

2024
[36]

Katharine Sanderson. 2023. GPT-4 is here: what scientists think.Nature615, 7954 (2023), 773

2023
[37]

Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster pre- diction.arXiv preprint arXiv:2201.02184(2022)

work page arXiv 2022
[38]

Stefan Smeu, Dragos-Alexandru Boldisor, Dan Oneata, and Elisabeta Oneata. 2025. Circumventing shortcuts in audio-visual deepfake detection datasets with unsu- pervised learning. InProceedings of the Computer Vision and Pattern Recognition Conference. 18815–18825

2025
[39]

Chung-Ting Tsai, Ching-Yun Ko, I Chung, Yu-Chiang Frank Wang, Pin-Yu Chen, et al. 2024. Understanding and improving training-free ai-generated image detections with vision foundation models.arXiv preprint arXiv:2411.19117(2024)

work page arXiv 2024
[40]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review arXiv 2024
[41]

Tianyi Wang, Mengxiao Huang, Harry Cheng, Xiao Zhang, and Zhiqi Shen
[42]

InProceedings of the 32nd ACM International Conference on Multimedia

Lampmark: Proactive deepfake detection via training-free landmark per- ceptual watermarks. InProceedings of the 32nd ACM International Conference on Multimedia. 10515–10524
[43]

Jiwei Wei, Yang Yang, Xing Xu, Jingkuan Song, Guoqing Wang, and Heng Tao Shen. 2023. Less is better: Exponential loss for cross-modal matching.IEEE Transactions on Circuits and Systems for Video Technology33, 9 (2023), 5271–5280

2023
[44]

Deressa Wodajo and Solomon Atnafu. 2021. Deepfake video detection using convolutional vision transformer.arXiv preprint arXiv:2102.11126(2021)

work page arXiv 2021
[45]

Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, and Roni Sengupta. 2026. Talkingheadbench: A multi-modal benchmark & analysis of talking-head deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4139–4149

2026
[46]

Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. 2023. Avoid-df: Audio-visual joint learning for detecting deepfake.IEEE Transactions on Information Forensics and Security18 (2023), 2015–2029

2023
[47]

Yutong Yang, Lifu Huang, Yijie Lin, Xi Peng, and Mouxing Yang. 2026. Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recog- nition. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 11802–11810

2026
[48]

Hongyun Yu, Zhan Qu, Qihang Yu, Jianchuan Chen, Zhonghua Jiang, Zhiwen Chen, Shengyu Zhang, Jimin Xu, Fei Wu, Chengfei Lv, et al. 2024. Gaussiantalker: Speaker-specific talking head synthesis via 3d gaussian splatting. InProceedings of the 32nd ACM International Conference on Multimedia. 3548–3557

2024
[49]

Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, and Chip Hong Chang
[50]

Unlocking the capabilities of large vision-language models for generalizable and explainable deepfake detection.arXiv preprint arXiv:2503.14853(2025)

work page arXiv 2025
[51]

Zhaoyang Zeng, Daniel McDuff, Yale Song, et al. 2021. Contrastive learning of global and local video representations.Advances in Neural Information Processing Systems34 (2021), 7025–7040

2021
[52]

Duzhen Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Xiuyi Chen, Yingying Zhang, et al. 2025. From system 1 to system 2: a survey of reasoning large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

2025
[53]

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Tip-adapter: Training-free adaption of clip for few-shot classification. InEuropean conference on computer vision. Springer, 493–510

2022
[54]

Yabin Zhang, Wenjie Zhu, Hui Tang, Zhiyuan Ma, Kaiyang Zhou, and Lei Zhang
[55]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Dual memory networks: A versatile adaptation approach for vision- language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 28718–28728
[56]

Yinglin Zheng, Jianmin Bao, Dong Chen, Ming Zeng, and Fang Wen. 2021. Ex- ploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF international conference on computer vision. 15044– 15054

2021
[57]

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models.International journal of computer vision 130, 9 (2022), 2337–2348

2022
[58]

Yipin Zhou and Ser-Nam Lim. 2021. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF international conference on computer vision. 14800– 14809

2021