FiSeR: Fine-Grained Source Representations for Cross-Domain AI Image Detection
Pith reviewed 2026-06-28 19:12 UTC · model grok-4.3
The pith
Hierarchical contrastive learning preserves generator identities to stabilize cross-domain synthetic image detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A hierarchical contrastive framework jointly optimizes a coarse objective that pulls natural images away from all synthetic images and a fine objective that pulls synthetic images together when they share the same generator; the resulting representations keep the natural-synthetic margin intact even after the test distribution shifts to new generators.
What carries the argument
Hierarchical contrastive learning with a coarse natural-versus-synthetic term and a fine generator-identity term among synthetic images.
If this is right
- Average cross-domain AUROC rises by 10.22 points relative to DIRE on Chameleon, AIGIBench, Community Forensics, and GenImage.
- Freezing the learned backbone and fitting an SVM on ten labeled samples per class raises AUROC by 10.64 points on AIGIBench and 17.41 points on Chameleon across twelve detectors.
- The learned features keep natural and synthetic clusters separable while also encoding generator identity.
- The decision criterion becomes less dependent on training-domain artifacts.
Where Pith is reading between the lines
- The same two-level contrastive pattern could be applied to other source-diverse detection problems such as video deepfakes or audio synthesis.
- Explicitly modeling generator identity may reduce the need for explicit domain-adaptation modules in future detectors.
- If generator labels are unavailable, clustering synthetic images by learned features might serve as a proxy for the fine objective.
Load-bearing premise
That the diversity of generators used to create training synthetic images supplies stable identity signals that remain useful when the test generators are different.
What would settle it
Training the same backbone on WildFake with only the coarse contrastive term and measuring whether cross-domain AUROC on the four held-out benchmarks falls to the level of the DIRE baseline.
Figures
read the original abstract
Real-world synthetic image detectors often generalize poorly under domain shift despite strong in-domain performance. Using unsupervised UMAP projections, we find that natural and synthetic features remain partially separable on unseen datasets, yet performance still drops, suggesting that the classification head overfits to training-domain artifacts. Therefore, the key is to learn more transferable representations so that the decision criterion is more stable and robust to domain shifts. Based on the structural fact that synthetic images are produced by diverse generators, we propose a hierarchical contrastive learning framework that improves the separability between natural and synthetic images while preserving generator identity information. It jointly optimizes (i) a coarse contrastive objective between natural and synthetic images and (ii) a fine contrastive objective among synthetic images using generator identities. Trained on WildFake, our method achieves an average AUROC gain of +10.22 on cross-domain evaluation over Chameleon, AIGIBench, Community Forensics, and GenImage under the same settings as the strong baseline DIRE. For few-shot adaptation, we freeze the backbone and fit an SVM head on 10 labeled samples per class, improving AUROC by +10.64 on AIGIBench and +17.41 on Chameleon, averaged over 12 widely used detectors. Our code is publicly available at: https://github.com/heyongxin233/FiSeR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FiSeR, a hierarchical contrastive learning framework for cross-domain AI-generated image detection. It jointly optimizes a coarse contrastive objective separating natural from synthetic images and a fine contrastive objective among synthetic images that preserves generator identities, motivated by UMAP observations of partial separability. Trained on WildFake, the method reports an average +10.22 AUROC gain over the DIRE baseline on cross-domain tests across Chameleon, AIGIBench, Community Forensics, and GenImage, plus +10.64 and +17.41 AUROC gains in few-shot SVM adaptation on two datasets when freezing the backbone.
Significance. If the reported cross-domain gains are reproducible and attributable to the hierarchical contrastive terms, the work would meaningfully advance synthetic image detection by demonstrating that generator-identity preservation can stabilize representations against domain shift. The public code release supports direct verification of the empirical claims.
major comments (3)
- [Abstract and §3] Abstract and §3 (Method): the central claim that the fine contrastive objective preserves generator identity to improve transferability lacks the explicit loss equations and weighting hyperparameters; without these, it is impossible to verify whether the +10.22 AUROC gain follows from the stated hierarchical structure or from other unstated implementation choices.
- [§4] §4 (Experiments): the reported average AUROC gain of +10.22 is presented without per-dataset breakdowns, standard deviations across runs, or ablation removing the fine contrastive term, which is load-bearing for attributing the improvement to generator-identity preservation rather than the coarse term or training data alone.
- [§4.2] §4.2 (Few-shot adaptation): the +10.64 / +17.41 AUROC figures for SVM heads on 10 samples per class are given without controls for backbone choice or comparison to other representation-learning baselines, undermining the claim that the learned representations are the decisive factor.
minor comments (2)
- [Abstract] The abstract lists four evaluation datasets but does not name the exact train/test splits or generator coverage within WildFake; adding this would improve reproducibility.
- Figure captions for UMAP projections should include the exact feature extractor and projection parameters used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight areas where additional detail will improve clarity and verifiability. We will revise the manuscript to incorporate the requested information on loss formulations, experimental breakdowns, and controls. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): the central claim that the fine contrastive objective preserves generator identity to improve transferability lacks the explicit loss equations and weighting hyperparameters; without these, it is impossible to verify whether the +10.22 AUROC gain follows from the stated hierarchical structure or from other unstated implementation choices.
Authors: We agree that explicit equations are needed for reproducibility. The manuscript describes the objectives at a high level but omits the precise formulations. In revision we will insert the full InfoNCE equations for both the coarse (natural vs. synthetic) and fine (generator-identity) terms, together with the balancing hyperparameter λ (set to 0.5). This will make the hierarchical structure and its contribution to the reported gains fully verifiable. revision: yes
-
Referee: [§4] §4 (Experiments): the reported average AUROC gain of +10.22 is presented without per-dataset breakdowns, standard deviations across runs, or ablation removing the fine contrastive term, which is load-bearing for attributing the improvement to generator-identity preservation rather than the coarse term or training data alone.
Authors: We accept that the current aggregate reporting limits attribution. The revised §4 will include a per-dataset table for all four test sets, standard deviations over three random seeds, and an ablation that removes the fine contrastive term while keeping the coarse term and training data identical. These additions will isolate the contribution of generator-identity preservation. revision: yes
-
Referee: [§4.2] §4.2 (Few-shot adaptation): the +10.64 / +17.41 AUROC figures for SVM heads on 10 samples per class are given without controls for backbone choice or comparison to other representation-learning baselines, undermining the claim that the learned representations are the decisive factor.
Authors: We will clarify that the backbone is the same ResNet-50 used by DIRE and will add comparisons against standard contrastive pretraining (e.g., SimCLR) and supervised contrastive learning on the identical WildFake data. Due to compute limits we cannot test every possible backbone, but the added controls will better substantiate that the hierarchical objectives drive the few-shot gains. revision: partial
Circularity Check
No significant circularity; empirical results on held-out data
full rationale
The paper proposes a hierarchical contrastive framework (coarse natural-vs-synthetic plus fine generator-identity terms) motivated by the structural diversity of generators, then reports measured AUROC gains on cross-domain test sets (WildFake training, evaluation on Chameleon/AIGIBench/etc.) and few-shot SVM adaptation. These are falsifiable empirical outcomes under stated experimental protocols, not quantities obtained by fitting a parameter to a subset and relabeling it a prediction, nor by self-definitional equations, nor by load-bearing self-citations whose content reduces to the present claim. The derivation chain consists of standard contrastive losses plus dataset splits; no step collapses to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic images are produced by diverse generators, enabling preservation of generator identity information to improve robustness.
Reference graph
Works this paper leans on
-
[1]
Hong, Y . and Zhang, J. Wildfake: A large-scale challenging dataset for ai-generated images detection.arXiv preprint arXiv:2402.11843,
-
[2]
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres- sive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B. F., Batifol, S., Blattmann, A., Boesel, F., Con- sul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Im- proving synthetic image detection towards generalization: An image transformation perspective
Li, O., Cai, J., Hao, Y ., Jiang, X., Hu, Y ., and Feng, F. Im- proving synthetic image detection towards generalization: An image transformation perspective. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pp. 2405–2414, 2025a. Li, Z., Yan, J., He, Z., Zeng, K., Jiang, W., Xiong, L., and Fu, Z. Is artificial ...
-
[5]
Decoupled Weight Decay Regularization
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Detecting GAN-generated Imagery using Color Cues
McCloskey, S. and Albright, M. Detecting gan- generated imagery using color cues.arXiv preprint arXiv:1812.08247,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
McInnes, L., Healy, J., and Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
De-fake: Detection and attribution of fake images generated by text-to-image generation models
Sha, Z., Li, Z., Yu, N., and Zhang, Y . De-fake: Detection and attribution of fake images generated by text-to-image generation models. InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pp. 3418–3432,
2023
-
[9]
Sim´eoni, O., V o, H. V ., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V ., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al. Dinov3.arXiv preprint arXiv:2508.10104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Dire for diffusion-generated image detection
Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., and Li, H. Dire for diffusion-generated image detection. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 22445–22455, 2023a. 11 FiSeR: Fine-Grained Source Representations for Cross-Domain AI Image Detection Wang, Z. J., Montoya, E., Munechika, D., Yang, H., Hoover, B...
-
[11]
Zhong, N., Xu, Y ., Li, S., Qian, Z., and Zhang, X. Patchcraft: Exploring texture patch for efficient ai-generated image detection.arXiv preprint arXiv:2311.12397,
-
[12]
Our method estimates the decision boundary via k-NN on features trained on Community. When trained on Community, the overall performance of all methods is generally lower than that of training on WildFake, and most baselines degrade more substantially. For example, ResNet-50 and CLIPDetection drop to 0 TPR5% on Chameleon. In contrast, our method maintains...
-
[13]
As the number of shots increases from 5 to 20, AUROC improves for all methods, and some methods already obtain substantial gains at 5-shot. This suggests that cross-domain degradation mainly comes from mismatch between the classifier head and the target-domain distribution, and a small amount of target-domain supervision can yield clear benefits. Meanwhil...
2083
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.