arxiv: 2605.10334 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

Andrii Yermakov , Jan Cech , Mario Fritz , Jiri Matas

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords deepfake detectionalpha blendingcompositing artifactsself-blended imagescross-dataset generalizationshortcut learningface manipulation

0 comments

The pith

Deepfake detectors primarily search for alpha blending compositing artifacts rather than semantic or generative cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances the Alpha Blending Hypothesis that current frame-based deepfake detectors work chiefly by locating low-level compositing traces left when manipulated faces are inserted into real frames. If this holds, then detectors succeed through a simple shortcut rather than by recognizing what makes an image forged. Experiments show these models react strongly to self-blended real images and to ordinary non-neural edits. The authors therefore train a detector using only real facial photos augmented with artificial blends and report competitive results across many held-out deepfake test collections. Combining the blending-focused model with one that ignores such cues further raises performance.

Core claim

The central claim is that state-of-the-art frame-based detectors primarily function as alpha blending searchers. They localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames, rather than learning semantic anomalies or specific generative neural fingerprints.

What carries the argument

The Alpha Blending Hypothesis, which frames detector success as the detection of alpha blending artifacts created when fake faces are composited into target frames.

If this is right

Detectors exhibit high sensitivity to self-blended images and to simple non-generative manipulations.
Training on real images augmented only with synthetic blends produces strong average cross-dataset generalization on compositional deepfake collections.
Explicit blending searchers and models that avoid blending shortcuts produce complementary predictions.
An ensemble of the two types of models yields improved detection results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many existing deepfake benchmarks may be solved largely by exploiting compositing cues instead of learning forgery semantics.
Future detectors could be hardened by training on data that deliberately varies or removes blending signals.
The method suggests it is possible to build effective detectors without ever training on explicitly generated deepfake examples.

Load-bearing premise

High sensitivity to self-blended real images and non-generative manipulations shows that detectors rely primarily on compositing artifacts rather than other cues.

What would settle it

A detector that continues to generalize across deepfake datasets after all compositing artifacts have been removed from both training and test images would falsify the hypothesis.

Figures

Figures reproduced from arXiv: 2605.10334 by Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz.

**Figure 2.** Figure 2: Sensitivity of GenD [53] and FS-VFM [44] to alpha blending (Hard/Soft discontinuities). Right: samples from “Real-on-Real” dataset with a +100% brightness; classes are in brackets. other augmentation is used during “fake” sample creation. Crucially, such samples contain no neural fingerprints. If a pre-trained detector is invariant to this manipulation, it will have a low AUROC. To ensure that the detectio… view at source ↗

**Figure 3.** Figure 3: DINO, CLIP and PE backbones: average cross-dataset test AUROC as a function of training [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Recent deepfake detection methods demonstrate improved cross-dataset generalization, yet the underlying mechanisms remain underexplored. We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames. We experimentally validate the hypothesis, demonstrating that deepfake detectors exhibit high sensitivity to the so-called self-blended images (SBI) and non-generative manipulations. We propose the method BlenD that leverages a large-scale, diverse dataset of real-only facial images augmented with SBI. This approach achieves the best average cross-dataset generalization on 15 compositional deepfake datasets released between 2019 and 2025 without utilizing explicitly generated deepfakes during training. Furthermore, we show that predictions from explicit blending searchers and models resilient to blending shortcuts are highly complementary, yielding a state-of-the-art AUROC of 94.0% in an ensemble configuration. The code with experiments and the trained model will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues that many deepfake detectors mainly hunt for alpha blending artifacts rather than semantic or generative cues, and shows that training on self-blended real images alone can match or beat prior cross-dataset results.

read the letter

The main point is that current frame-based deepfake detectors appear to rely heavily on spotting low-level compositing artifacts from alpha blending, not on learning what actually makes a face fake. The authors test this by checking detector sensitivity to self-blended images (real faces blended with themselves) and non-generative manipulations, then introduce BlenD, which trains only on real images augmented with these blends. This setup delivers the best average generalization across 15 recent compositional deepfake datasets without ever using generated fakes during training, and an ensemble with blending-resilient models reaches 94% AUROC.

Referee Report

2 major / 3 minor

Summary. The paper introduces the Alpha Blending Hypothesis, which posits that state-of-the-art frame-based deepfake detectors primarily function as alpha blending searchers that localize low-level compositing artifacts from face integration rather than semantic anomalies or generative neural fingerprints. It validates the hypothesis empirically by demonstrating high detector sensitivity to self-blended images (SBI) and non-generative manipulations, proposes the BlenD training method that uses only real facial images augmented with SBI to achieve the best average cross-dataset generalization across 15 compositional deepfake datasets (2019-2025) without any explicitly generated deepfakes, and shows that an ensemble of explicit blending searchers with blending-resilient models reaches 94.0% AUROC. The code, experiments, and trained model are to be released publicly.

Significance. If the central results hold, the work is significant for offering a mechanistic account of why recent deepfake detectors generalize well across datasets and for introducing a practical, real-image-only training regime that sidesteps the need to generate or curate synthetic deepfakes. The scale of the evaluation (15 datasets spanning multiple years), the demonstration of complementarity via the ensemble, and the commitment to public code and model release are clear strengths that support reproducibility and further investigation in the field.

major comments (2)

[Hypothesis validation experiments] Abstract and hypothesis validation: The claim that detectors 'primarily' function as alpha blending searchers is inferred from observed high sensitivity to SBI and non-generative manipulations plus the success of the SBI-only BlenD regime. However, this remains correlational evidence of sufficiency; the manuscript does not include a controlled ablation that removes or masks alpha-blending artifacts while preserving other potential cues (semantic inconsistencies or generative fingerprints) and then quantifies the resulting performance degradation. Such an isolation test is load-bearing for the 'primarily' qualifier and would be required to rule out mixed reliance on multiple cue types.
[BlenD training and evaluation] BlenD method and cross-dataset results: While the average generalization across 15 datasets is presented as state-of-the-art, the manuscript should provide per-dataset AUROC breakdowns (ideally in a table) together with explicit details on SBI generation parameters, the real-to-blended sample ratio, and any post-processing steps. Without these, it is difficult to assess whether the reported gains are robust or could be influenced by unstated dataset-construction choices.

minor comments (3)

[Abstract] The abstract states that BlenD 'achieves the best average cross-dataset generalization' but does not report the numerical average AUROC value or list the specific baseline methods and scores being compared; adding these figures would make the summary self-contained.
[Figures illustrating SBI and detector responses] Figure captions and SBI examples should explicitly state the blending parameters (alpha range, mask type, source/target image selection) used to generate the illustrated self-blended images so that readers can reproduce the sensitivity tests.
[Discussion or conclusion] The manuscript would benefit from a short discussion of potential limitations, such as whether the alpha-blending shortcut remains dominant for video-level or temporal deepfake detectors that may exploit additional motion cues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The claim that detectors 'primarily' function as alpha blending searchers is inferred from high sensitivity to SBI and non-generative manipulations plus BlenD success. This is correlational evidence; a controlled ablation removing alpha-blending artifacts while preserving other cues (semantic inconsistencies or generative fingerprints) is needed to quantify degradation and support the 'primarily' qualifier.

Authors: We agree that the current evidence is primarily correlational and that a direct ablation isolating blending artifacts from other potential cues would offer stronger causal support for the 'primarily' claim. Designing such an experiment without introducing new confounds is technically challenging. In the revised manuscript we will add a dedicated limitations subsection that explicitly discusses this gap, refine the abstract and hypothesis statements to avoid overclaiming, and include additional supporting analyses (e.g., further sensitivity tests on non-blending manipulations). We believe the combination of SBI sensitivity, BlenD's cross-dataset performance without synthetic deepfakes, and complementarity with blending-resilient models still provides meaningful mechanistic insight. revision: partial
Referee: The manuscript should provide per-dataset AUROC breakdowns (ideally in a table) together with explicit details on SBI generation parameters, the real-to-blended sample ratio, and any post-processing steps.

Authors: We agree that these details will improve transparency and allow readers to better evaluate robustness. The revised manuscript will include a new table reporting AUROC for each of the 15 individual datasets. We will also expand the experimental setup section to specify all SBI generation parameters, the exact real-to-blended training ratio, and any post-processing steps applied. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hypothesis tested on external data without self-referential reduction

full rationale

The paper advances the Alpha Blending Hypothesis through experimental validation on external deepfake datasets and non-generative manipulations, then introduces BlenD trained exclusively on real images augmented with self-blended images. No derivation chain, equation, or central claim reduces by construction to fitted parameters, self-citations, or renamed inputs; the sufficiency of blending artifacts is demonstrated via cross-dataset AUROC on 15 independent compositional sets, and complementarity with other models is shown empirically. The approach remains self-contained against external benchmarks with no load-bearing self-citation or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that SBI and non-generative manipulations sufficiently isolate alpha blending effects without introducing confounding factors; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Self-blended images (SBI) and non-generative manipulations isolate low-level compositing artifacts that match those in real deepfakes.
Invoked to validate that detector sensitivity to SBI indicates primary reliance on blending shortcuts rather than semantic features.

pith-pipeline@v0.9.0 · 5499 in / 1286 out tokens · 36200 ms · 2026-05-12T05:05:53.568763+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

I=M⊙IF+(1−M)⊙IB (alpha blending equation) and SBI self-blended image construction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

[1]

The DeepSpeak Dataset

Sarah Barrington, Matyas Bohacek, and Hany Farid. DeepSpeak dataset v1.0.arXiv preprint arXiv:2408.05366, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Perception encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Abdul Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Shang-Wen Li, Piotr Dollar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. In The...

work page 2025
[3]

Burt and Edward H

Peter J. Burt and Edward H. Adelson. A multiresolution spline with application to image mosaics. ACM Trans. Graph., 2(4):217–236, October 1983. ISSN 0730-0301. doi: 10.1145/245.247. URLhttps://doi.org/10.1145/245.247

work page doi:10.1145/245.247 1983
[4]

Forensics adapter: Adapting CLIP for generalizable face forgery detection

Xinjie Cui, Yuezun Li, Ao Luo, Jiaran Zhou, and Junyu Dong. Forensics adapter: Adapting CLIP for generalizable face forgery detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19207–19217, 2025

work page 2025
[5]

Detecting lip-syncing deepfakes: Vision temporal transformer for analyzing mouth inconsistencies.arXiv preprint arXiv:2504.01470, 2025

Soumyya Kanti Datta, Shan Jia, and Siwei Lyu. Detecting lip-syncing deepfakes: Vision temporal transformer for analyzing mouth inconsistencies.arXiv preprint arXiv:2504.01470, 2025

work page arXiv 2025
[6]

Retinaface: Single-shot multi-level face localisation in the wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5203–5212, 2020

work page 2020
[7]

The DeepFake Detection Challenge (DFDC) Dataset

Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (DFDC) dataset.arXiv preprint arXiv:2006.07397, 2020

work page internal anchor Pith review arXiv 2006
[8]

Deepfakes Detection Dataset by Google & Jigsaw

Nicholas Dufour, Andrew Gully, Per Karlsson, Alexey Victor V orbyov, Thomas Leung, Jeremiah Childs, and Christoph Bregler. Deepfakes Detection Dataset by Google & Jigsaw. https:// research.google/blog/contributing-data-to-deepfake-detection-research/ , 2019

work page 2019
[9]

arXiv preprint arXiv:2407.03168 , year =

Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Liveportrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024

work page arXiv 2024
[10]

Lips don’t lie: A generalisable and robust approach to face forgery detection

Alexandros Haliassos, Konstantinos V ougioukas, Stavros Petridis, and Maja Pantic. Lips don’t lie: A generalisable and robust approach to face forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5039–5049, 2021

work page 2021
[11]

Leveraging real talking faces via self-supervision for robust forgery detection

Alexandros Haliassos, Rodrigo Mira, Stavros Petridis, and Maja Pantic. Leveraging real talking faces via self-supervision for robust forgery detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14950–14962, 2022

work page 2022
[12]

FSBI: Deepfake detection with frequency enhanced self-blended images.Image and Vision Computing, 154: 105418, 2025

Ahmed Abul Hasanaath, Hamzah Luqman, Raed Katib, and Saeed Anwar. FSBI: Deepfake detection with frequency enhanced self-blended images.Image and Vision Computing, 154: 105418, 2025

work page 2025
[13]

PolyGlotFake: A novel multilingual and multimodal deepfake dataset

Yang Hou, Haitao Fu, Chunkai Chen, Zida Li, Haoyu Zhang, and Jianjun Zhao. PolyGlotFake: A novel multilingual and multimodal deepfake dataset. InInternational Conference on Pattern Recognition, pages 180–193. Springer, 2024

work page 2024
[14]

Exposing gan-generated faces using inconsistent corneal specular highlights

Shu Hu, Yuezun Li, and Siwei Lyu. Exposing gan-generated faces using inconsistent corneal specular highlights. InICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2500–2504. IEEE, 2021

work page 2021
[15]

Model attribution of face-swap deepfake videos

Shan Jia, Xin Li, and Siwei Lyu. Model attribution of face-swap deepfake videos. In2022 IEEE International Conference on Image Processing (ICIP), pages 2356–2360. IEEE, 2022. 10

work page 2022
[16]

Khalid, S

Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S Woo. FakeA VCeleb: A novel audio- video multimodal deepfake dataset.arXiv preprint arXiv:2108.05080, 2021

work page arXiv 2021
[17]

Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection

Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, and Jongwon Choi. Beyond spatial frequency: Pixel-wise temporal frequency-based deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11198–11207, 2025

work page 2025
[18]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

KoDF: A large-scale korean deepfake detection dataset

Patrick Kwon, Jaeseong You, Gyuhyeon Nam, Sungwoo Park, and Gyeongsu Chae. KoDF: A large-scale korean deepfake detection dataset. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10744–10753, 2021

work page 2021
[20]

Seeable: Soft discrepancies and bounded contrastive learning for exposing deepfakes

Nicolas Larue, Ngoc-Son Vu, Vitomir Struc, Peter Peer, and Vassilis Christophides. Seeable: Soft discrepancies and bounded contrastive learning for exposing deepfakes. InProceedings of the IEEE/CVF international conference on computer vision, pages 21011–21021, 2023

work page 2023
[21]

Sok: Systematization and benchmarking of deepfake detectors in a unified framework

Binh M Le, Jiwon Kim, Simon S Woo, Kristen Moore, Alsharif Abuadbba, and Shahroz Tariq. Sok: Systematization and benchmarking of deepfake detectors in a unified framework. In2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P), pages 883–902. IEEE, 2025

work page 2025
[22]

Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

work page arXiv 2024
[23]

Advancing high fidelity identity swapping for forgery detection

Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Advancing high fidelity identity swapping for forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5074–5083, 2020

work page 2020
[24]

In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking

Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. InIEEE International Workshop on Information Forensics and Security (WIFS), 2018

work page 2018
[25]

Celeb-DF: A large-scale challeng- ing dataset for deepfake forensics

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challeng- ing dataset for deepfake forensics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3207–3216, 2020

work page 2020
[26]

Celeb-DF++: A large-scale challenging video deepfake benchmark for generalizable forensics.arXiv preprint arXiv:2507.18015, 2025

Yuezun Li, Delong Zhu, Xinjie Cui, and Siwei Lyu. Celeb-DF++: A large-scale challenging video deepfake benchmark for generalizable forensics.arXiv preprint arXiv:2507.18015, 2025

work page arXiv 2025
[27]

Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection

Yuzhen Lin, Wentang Song, Bin Li, Yuezun Li, Jiangqun Ni, Han Chen, and Qiushi Li. Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection. InEuropean conference on computer vision, pages 104–122. Springer, 2024

work page 2024
[28]

Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes

Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, and Run Wang. Lips are lying: Spotting the temporal inconsistency between audio and visual in lip-syncing deepfakes. Advances in Neural Information Processing Systems, 37:91131–91155, 2024

work page 2024
[29]

Generalizing face forgery detection with high-frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high-frequency features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021

work page 2021
[30]

Exploiting visual artifacts to expose deepfakes and face manipulations

Falko Matern, Christian Riess, and Marc Stamminger. Exploiting visual artifacts to expose deepfakes and face manipulations. In2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pages 83–92, 2019

work page 2019
[31]

Diff2lip: Audio conditioned diffusion models for lip-synchronization

Soumik Mukhopadhyay, Saksham Suri, Ravi Teja Gadde, and Abhinav Shrivastava. Diff2lip: Audio conditioned diffusion models for lip-synchronization. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5292–5302, January 2024. 11

work page 2024
[32]

Deconvolution and checkerboard artifacts

Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 2016. URLhttp://distill.pub/2016/deconv-checkerboard/

work page 2016
[33]

Towards universal fake image detectors that generalize across generative models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24480–24489, 2023

work page 2023
[34]

Poisson image editing.ACM Trans

Patrick Pérez, Michel Gangnet, and Andrew Blake. Poisson image editing.ACM Trans. Graph., 22(3):313–318, July 2003. ISSN 0730-0301. doi: 10.1145/882262.882269. URL https://doi.org/10.1145/882262.882269

work page doi:10.1145/882262.882269 2003
[35]

Parameter-efficient tuning on layer normalization for pre- trained language models.arXiv preprint arXiv:2211.08682,

Wang Qi, Yu-Ping Ruan, Yuan Zuo, and Taihao Li. Parameter-efficient tuning on layer normal- ization for pre-trained language models.arXiv preprint arXiv:2211.08682, 2022

work page arXiv 2022
[36]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

work page 2021
[37]

Faceforensics++: Learning to detect manipulated facial images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019

work page 2019
[38]

Towards real-world deepfake detection: A diverse in-the-wild dataset of forgery faces.arXiv preprint arXiv:2510.08067, 2025

Junyu Shi, Minghui Li, Junguo Zuo, Zhifei Yu, Yipeng Lin, Shengshan Hu, Ziqi Zhou, Yechao Zhang, Wei Wan, Yinzhe Xu, et al. Towards real-world deepfake detection: A diverse in-the-wild dataset of forgery faces.arXiv preprint arXiv:2510.08067, 2025

work page arXiv 2025
[39]

Detecting deepfakes with self-blended images

Kaede Shiohara and Toshihiko Yamasaki. Detecting deepfakes with self-blended images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18720–18729, 2022

work page 2022
[40]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. DINOv3. arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Cyclical learning rates for training neural networks

Leslie N Smith. Cyclical learning rates for training neural networks. InProceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472. IEEE, 2017

work page 2017
[42]

Illumination enlightened spatial- temporal inconsistency for deepfake video detection

Kaiyue Tian, Chen Chen, Yichao Zhou, and Xiyuan Hu. Illumination enlightened spatial- temporal inconsistency for deepfake video detection. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024

work page 2024
[43]

FSFM: A generalizable face security foundation model via self-supervised facial representation learning

Gaojian Wang, Feng Lin, Tong Wu, Zhenguang Liu, Zhongjie Ba, and Kui Ren. FSFM: A generalizable face security foundation model via self-supervised facial representation learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24364– 24376, 2025

work page 2025
[44]

Scalable face security vision foundation model for deepfake, diffusion, and spoofing detection.arXiv preprint arXiv:2510.10663, 2025

Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, and Kui Ren. Scalable face security vision foundation model for deepfake, diffusion, and spoofing detection.arXiv preprint arXiv:2510.10663, 2025

work page arXiv 2025
[45]

arXiv preprint arXiv:2510.16320 , year=

Wenhao Wang, Longqi Cai, Taihong Xiao, Yuxiao Wang, and Ming-Hsuan Yang. Scaling laws for deepfake detection.arXiv preprint arXiv:2510.16320, 2025

work page arXiv 2025
[46]

Identity- driven multimedia forgery detection via reference assistance

Junhao Xu, Jingjing Chen, Xue Song, Feng Han, Haijun Shan, and Yu-Gang Jiang. Identity- driven multimedia forgery detection via reference assistance. InProceedings of the 32nd ACM International Conference on Multimedia, pages 3887–3896, 2024

work page 2024
[47]

Learning spatiotemporal inconsistency via thumbnail layout for face deepfake detection.International Journal of Computer Vision, 132(12):5663–5680, 2024

Yuting Xu, Jian Liang, Lijun Sheng, and Xiao-Yu Zhang. Learning spatiotemporal inconsistency via thumbnail layout for face deepfake detection.International Journal of Computer Vision, 132(12):5663–5680, 2024. 12

work page 2024
[48]

UCF: Uncovering common features for generalizable deepfake detection

Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. UCF: Uncovering common features for generalizable deepfake detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 22412–22423, 2023

work page 2023
[49]

Deep- fakebench: A comprehensive benchmark of deepfake detection

Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. Deep- fakebench: A comprehensive benchmark of deepfake detection. In A. Oh, T. Neu- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 4534–4565. Curran Associates, Inc., 2023. URL https://proceedings.neurips...

work page 2023
[50]

Transcending forgery specificity with latent space augmentation for generalizable deepfake detection

Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, and Baoyuan Wu. Transcending forgery specificity with latent space augmentation for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8984–8994, 2024

work page 2024
[51]

Orthogonal subspace decomposition for generalizable AI-generated image detection

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable AI-generated image detection. InProceedings of the International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=GFpjO8S8Po

work page 2025
[52]

Exposing deep fakes using inconsistent head poses

Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8261–8265. IEEE, 2019

work page 2019
[53]

Deepfake detection that generalizes across benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, and Mario Fritz. Deepfake detection that generalizes across benchmarks. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2026

work page 2026
[54]

HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models.arXiv preprint arXiv:2410.22901, 2024

Shengkai Zhang, Nianhong Jiao, Tian Li, Chaojie Yang, Chenhui Xue, Boya Niu, and Jun Gao. HelloMeme: Integrating spatial knitting attentions to embed high-level and fidelity-rich conditions in diffusion models.arXiv preprint arXiv:2410.22901, 2024

work page arXiv 2024
[55]

Detecting and simulating artifacts in GAN fake images

Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting and simulating artifacts in GAN fake images. In2019 IEEE international workshop on information forensics and security (WIFS), pages 1–6. IEEE, 2019

work page 2019
[56]

Memo: Memory-guided diffusion for expressive talking video generation.arXiv preprint arXiv:2412.04448, 2024

Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, and Shuicheng Yan. Memo: Memory-guided diffusion for expressive talking video generation.arXiv preprint arXiv:2412.04448, 2024

work page arXiv 2024
[57]

FreqBlender: Enhancing deep- fake detection by blending frequency knowledge.Advances in Neural Information Processing Systems, 37:44965–44988, 2024

Jiaran Zhou, Yuezun Li, Baoyuan Wu, Bin Li, Junyu Dong, et al. FreqBlender: Enhancing deep- fake detection by blending frequency knowledge.Advances in Neural Information Processing Systems, 37:44965–44988, 2024

work page 2024
[58]

Face forensics in the wild

Tianfei Zhou, Wenguan Wang, Zhiyuan Liang, and Jianbing Shen. Face forensics in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5778–5788, 2021

work page 2021
[59]

immunization

Xiangyu Zhu, Hao Wang, Hongyan Fei, Zhen Lei, and Stan Z Li. Face forgery detection by 3D decomposition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2929–2939, 2021. 13 Supplementary Material S1 Supplementary material overview This supplementary material provides additional experimental results and detailed d...

work page 2021