Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation
Pith reviewed 2026-06-30 21:15 UTC · model grok-4.3
The pith
Vision Mamba models exhibit competitive efficiency yet lower accuracy and weaker generalization than CNNs, ViTs, and VLMs when detecting AI-generated images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision Mamba architectures, when adapted for binary real-versus-synthetic classification, achieve inference speeds that surpass most transformer baselines while delivering accuracy that remains below the best CNN and VLM detectors; the gap widens on out-of-distribution generators, showing that state-space visual models can contribute to detection pipelines but require additional adaptation to match established methods in reliability.
What carries the argument
Vision Mamba, a selective state-space model backbone for image classification, evaluated here as a drop-in feature extractor for distinguishing authentic from AI-generated images.
If this is right
- Mamba-based detectors can reduce computational cost in large-scale screening systems that must process millions of images daily.
- The observed accuracy shortfall implies that pure Mamba pipelines may need supplementary modules such as frequency-domain filters or ensemble heads to reach deployment thresholds.
- Cross-generator evaluation shows that training on a narrow set of synthetic sources produces brittle detectors, regardless of backbone architecture.
- Efficiency gains position Vision Mamba as a candidate for on-device or edge-based detection where latency matters more than marginal accuracy.
Where Pith is reading between the lines
- Hybrid architectures that replace only the attention layers of a ViT with Mamba blocks could combine the strengths of both without full retraining.
- The speed advantage may prove decisive in video or live-stream settings where frame-by-frame detection is required.
- Transfer from Mamba models pretrained on medical or satellite imagery could supply better initial features for the detection task than ImageNet weights alone.
Load-bearing premise
The chosen datasets, generative models, and evaluation metrics sufficiently represent real-world conditions and capture generalizability for AI-generated image detection.
What would settle it
Retraining and testing the same Mamba variants on a fresh dataset of images from a diffusion model released after the paper's experiments, using the identical train-test split protocol, would show whether the reported accuracy gap persists.
Figures
read the original abstract
In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a benchmark study evaluating several Vision Mamba variants for the task of distinguishing authentic images from AI-generated ones. It compares these models against representative CNNs, Vision Transformers, and VLM-based detectors across multiple datasets and generative sources, reporting on accuracy, efficiency, and cross-model generalizability, and concludes that Mamba architectures show both promise and current limitations for this application.
Significance. If the empirical comparisons prove robust and reproducible, the work would supply a useful reference point for selecting efficient sequence-modeling backbones in synthetic-media detection pipelines, particularly where computational cost is a concern relative to transformer-based alternatives.
major comments (2)
- [Abstract] The abstract states that the study benchmarks 'multiple Vision Mamba variants' and reports 'key metrics such as accuracy, efficiency, and generalizability,' yet no quantitative results, tables, or statistical details (e.g., means, standard deviations, or significance tests) appear in the provided text; without these, the central claim of 'promise and current limitations' cannot be evaluated.
- [Abstract / Experimental Setup] The weakest assumption identified—that the chosen datasets and generative models capture real-world generalizability—is load-bearing for the paper's conclusions, but the manuscript supplies no information on data splits, number of runs, or out-of-distribution test sets that would allow readers to assess this assumption.
minor comments (1)
- [Introduction] The introduction lists recent architectures but does not cite the original Mamba or Vision Mamba papers; adding these references would improve context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our benchmark study of Vision Mamba for AI-generated image detection. We address each major comment below and will revise the manuscript to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] The abstract states that the study benchmarks 'multiple Vision Mamba variants' and reports 'key metrics such as accuracy, efficiency, and generalizability,' yet no quantitative results, tables, or statistical details (e.g., means, standard deviations, or significance tests) appear in the provided text; without these, the central claim of 'promise and current limitations' cannot be evaluated.
Authors: We agree that the abstract would be strengthened by including key quantitative highlights. In the revised version, we will add specific results such as average accuracies (with standard deviations where multiple runs were performed), efficiency comparisons (e.g., FLOPs or inference time), and a brief note on generalizability trends to better support the claims of promise and limitations. revision: yes
-
Referee: [Abstract / Experimental Setup] The weakest assumption identified—that the chosen datasets and generative models capture real-world generalizability—is load-bearing for the paper's conclusions, but the manuscript supplies no information on data splits, number of runs, or out-of-distribution test sets that would allow readers to assess this assumption.
Authors: We acknowledge that explicit details on experimental reproducibility are essential. While the Experimental Setup section describes the datasets and generative sources, we will expand it in revision to include precise train/validation/test splits, the number of independent runs with reported means and standard deviations, and any out-of-distribution evaluations to allow readers to better evaluate the generalizability claims. revision: yes
Circularity Check
No significant circularity; purely empirical benchmark
full rationale
The paper performs a systematic empirical comparison of Vision Mamba variants against CNNs, ViTs, and VLMs on multiple datasets for AI-generated image detection. It reports accuracy, efficiency, and generalizability metrics from direct experiments with no equations, derivations, fitted parameters relabeled as predictions, or load-bearing self-citations. The abstract and described scope frame the work as an external benchmark study whose results are falsifiable against held-out data and independent implementations. No self-definitional, ansatz-smuggling, or renaming patterns appear. This matches the default expectation for non-circular empirical papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Create with firefly generative ai.https://www.adobe.com/products/firefly.html
Adobe, 2023. Create with firefly generative ai.https://www.adobe.com/products/firefly.html. Accessed: 2024-10-10
2023
-
[2]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Brock, A., 2018. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Chai, L., Bau, D., Lim, S.N., Isola, P., 2020. What makes fake images detectable? understanding properties that generalize, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, Springer. pp. 103–120
2020
-
[4]
Antifakeprompt: Prompt-tuned vision-language models are fake image detectors
Chang, Y.M., Yeh, C., Chiu, W.C., Yu, N., 2023. Antifakeprompt: Prompt-tuned vision-language models are fake image detectors. arXiv preprint arXiv:2310.17419
-
[5]
Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
Chen, C., Chen, Q., Xu, J., Koltun, V., 2018. Learning to see in the dark, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3291–3300
2018
-
[6]
Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp
Chen, Q., Koltun, V., 2017. Photographic image synthesis with cascaded refinement networks, in: Proceedings of the IEEE international conference on computer vision, pp. 1511–1520
2017
-
[7]
Chen,Y.,Zhang,L.,Niu,Y.,Chen,P.,Tan,L.,Zhou,J.,2024. Guidedandfused:Efficientfrozenclip-vitwithfeatureguidanceandmulti-stage feature fusion for generalizable deepfake detection. arXiv preprint arXiv:2408.13697
-
[8]
Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J., 2018. Stargan: Unified generative adversarial networks for multi-domain image-to- image translation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797
2018
-
[9]
Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp
Chollet,F.,2017. Xception:Deeplearningwithdepthwiseseparableconvolutions,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp. 1251–1258
2017
-
[10]
Cozzolino,D.,Poggi,G.,Corvi,R.,Nießner,M.,Verdoliva,L.,2023. Raisingthebarofai-generatedimagedetectionwithclip. arXivpreprint arXiv:2312.00195
-
[11]
Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L., 2019. Second-order attention network for single image super-resolution, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11065–11074
2019
-
[12]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Dai,W.,Li,J.,Li,D.,Tiong,A.M.H.,Zhao,J.,Wang,W.,Li,B.,Fung,P.,Hoi,S.,2023.Instructblip:Towardsgeneral-purposevision-language models with instruction tuning.arXiv:2305.06500. : Preprint submitted to Elsevier Page 21 of 24
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Dao, T., Gu, A., 2024. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Vision Transformers Need Registers
Darcet, T., Oquab, M., Mairal, J., Bojanowski, P., 2023. Vision transformers need registers. arXiv preprint arXiv:2309.16588
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Diffusion models beat gans on image synthesis
Dhariwal, P., Nichol, A., 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794
2021
-
[16]
Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers
Ding,M.,Zheng,W.,Hong,W.,Tang,J.,2022. Cogview2:Fasterandbettertext-to-imagegenerationviahierarchicaltransformers. Advances in Neural Information Processing Systems 35, 16890–16902
2022
-
[17]
Fusion-mambaforcross-modalityobjectdetection
Dong,W.,Zhu,H.,Lin,S.,Luo,X.,Shen,Y.,Liu,X.,Zhang,J.,Guo,G.,Zhang,B.,2024. Fusion-mambaforcross-modalityobjectdetection. arXiv preprint arXiv:2404.09146
-
[18]
A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm
Dos Santos, R., Aguilar, J., 2024. A synthetic data generation system based on the variational-autoencoder technique and the linked data paradigm. Progress in Artificial Intelligence , 1–15
2024
-
[19]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[20]
Generativeadversarialnets
Goodfellow,I.,Pouget-Abadie,J.,Mirza,M.,Xu,B.,Warde-Farley,D.,Ozair,S.,Courville,A.,Bengio,Y.,2014. Generativeadversarialnets. Advances in neural information processing systems 27
2014
-
[21]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A., Dao, T., 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp
Hatamizadeh,A., Kautz,J.,2025. Mambavision: Ahybridmamba-transformer visionbackbone,in: Proceedingsofthe ComputerVisionand Pattern Recognition Conference, pp. 25261–25270
2025
-
[23]
Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
2016
-
[24]
Ho,J.,Jain,A.,Abbeel,P.,2020.Denoisingdiffusionprobabilisticmodels.Advancesinneuralinformationprocessingsystems33,6840–6851
2020
-
[25]
Localmamba: Visual state space model with windowed selective scan
Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C., 2024a. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338
-
[26]
Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant
Huang, Z., Xia, B., Lin, Z., Mou, Z., Yang, W., 2024b. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant. arXiv preprint arXiv:2408.10072
-
[27]
Synthetic face discrimination via learned image compression
Iliopoulou, S., Tsinganos, P., Ampeliotis, D., Skodras, A., 2024. Synthetic face discrimination via learned image compression. Algorithms 17, 375
2024
-
[28]
Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer
Karageogiou,D.,Bammey,Q.,Porcellini,V.,Goupil,B.,Teyssou,D.,Papadopoulos,S.,2024. Evolutionofdetectionperformancethroughout the online lifespan of synthetic images, in: European Conference on Computer Vision, Springer. pp. 400–417
2024
-
[29]
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Karras, T., 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Alias-freegenerativeadversarialnetworks
Karras,T.,Aittala,M.,Laine,S.,Härkönen,E.,Hellsten,J.,Lehtinen,J.,Aila,T.,2021. Alias-freegenerativeadversarialnetworks. Advances in neural information processing systems 34, 852–863
2021
-
[31]
Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp
Karras,T.,Laine,S.,Aila,T.,2019. Astyle-basedgeneratorarchitectureforgenerativeadversarialnetworks,in:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410
2019
-
[32]
Harnessing the power of large vision language models for synthetic image detection
Keita, M., Hamidouche, W., Bougueffa, H., Hadid, A., Taleb-Ahmed, A., 2024. Harnessing the power of large vision language models for synthetic image detection. arXiv preprint arXiv:2404.02726
-
[33]
Bi-lora:Avision-languageapproach for synthetic image detection
Keita,M.,Hamidouche,W.,BougueffaEutamene,H.,Taleb-Ahmed,A.,Camacho,D.,Hadid,A.,2025. Bi-lora:Avision-languageapproach for synthetic image detection. Expert Systems 42, e13829
2025
-
[34]
Texturecrop: Enhancing synthetic image detection through texture-based cropping
Konstantinidou, D., Koutlis, C., Papadopoulos, S., 2024. Texturecrop: Enhancing synthetic image detection through texture-based cropping. arXiv preprint arXiv:2407.15500
-
[35]
Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Li, K., Zhang, T., Malik, J., 2019. Diverse image synthesis from semantic layouts via conditional imle, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4220–4229
2019
-
[36]
Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata
Li,S.,Singh,H.,Grover,A.,2024. Mamba-nd:Selectivestatespacemodelingformulti-dimensionaldata. arXivpreprintarXiv:2402.05892
-
[37]
Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Liu,H.,Tan,Z.,Tan,C.,Wei,Y.,Wang,J.,Zhao,Y.,2024a. Forgery-awareadaptivetransformerforgeneralizablesyntheticimagedetection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10770–10780
-
[38]
Pseudo numerical methods for diffusion models on manifolds
Liu, L., Ren, Y., Lin, Z., Zhao, Z., 2022. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778
-
[39]
VMamba: Visual State Space Model
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y., 2024b. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation
Ma, J., Li, F., Wang, B., 2024. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Midjourney v5.https://www.midjourney.com
MidJourney, 2023. Midjourney v5.https://www.midjourney.com. Accessed: 2024-10-10
2023
-
[42]
Detecting gan generated fake images using co-occurrence matrices
Nataraj, L., Mohammed, T.M., Chandrasekaran, S., Flenner, A., Bappy, J.H., Roy-Chowdhury, A.K., Manjunath, B., 2019. Detecting gan generated fake images using co-occurrence matrices. arXiv preprint arXiv:1903.06836
-
[43]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Nichol,A.,Dhariwal,P.,Ramesh,A.,Shyam,P.,Mishkin,P.,McGrew,B.,Sutskever,I.,Chen,M.,2021. Glide:Towardsphotorealisticimage generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[44]
Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR
Nichol, A.Q., Dhariwal, P., 2021. Improved denoising diffusion probabilistic models, in: International Conference on Machine Learning, PMLR. pp. 8162–8171
2021
-
[45]
Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Ojha, U., Li, Y., Lee, Y.J., 2023. Towards universal fake image detectors that generalize across generative models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24480–24489
2023
-
[46]
Dall-e 3.https://openai.com/dall-e-3
OpenAI, 2023. Dall-e 3.https://openai.com/dall-e-3. Accessed: 2024-10-10
2023
-
[47]
Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE
Paik,S.,Bonna,S.,Novozhilova,E.,Gao,G.,Kim,J.,Wijaya,D.,Betke,M.,2023. Theaffectivenatureofai-generatednewsimages:Impact on visual journalism, in: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE. pp. 1–8
2023
-
[48]
Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y., 2019. Gaugan: semantic image synthesis with spatially adaptive normalization, in: ACM SIGGRAPH 2019 Real-Time Live!, pp. 1–1. : Preprint submitted to Elsevier Page 22 of 24
2019
-
[49]
Simba: Simplified mamba-based architecture for vision and multivariate time series
Patro, B.N., Agneeswaran, V.S., 2024. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360
-
[50]
Efficientvmamba: Atrous selective scan for light weight visual mamba
Pei, X., Huang, T., Xu, C., 2024. Efficientvmamba: Atrous selective scan for light weight visual mamba. arXiv preprint arXiv:2403.09977
-
[51]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R., 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer
Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J., 2020. Thinking in frequency: Face forgery detection by mining frequency-aware clues, in: European conference on computer vision, Springer. pp. 86–103
2020
-
[53]
Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I., 2021. Zero-shot text-to-image generation, in: International conference on machine learning, Pmlr. pp. 8821–8831
2021
-
[54]
Autoregressive pretraining with mamba in vision
Ren, S., Li, X., Tu, H., Wang, F., Shu, F., Zhang, L., Mei, J., Yang, L., Wang, P., Wang, H., et al., 2024. Autoregressive pretraining with mamba in vision. arXiv preprint arXiv:2406.07537
-
[55]
Ricker, J., Damm, S., Holz, T., Fischer, A., 2022. Towards the detection of diffusion model deepfakes. arXiv preprint arXiv:2210.14571
-
[56]
High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695
2022
-
[57]
Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M., 2019. Faceforensics++: Learning to detect manipulated facial images, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1–11
2019
-
[58]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K., 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500–22510
2023
-
[59]
Saharia,C.,Chan,W.,Saxena,S.,Li,L.,Whang,J.,Denton,E.L.,Ghasemipour,K.,GontijoLopes,R.,KaragolAyan,B.,Salimans,T.,etal.,
-
[60]
Advances in neural information processing systems 35, 36479–36494
Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35, 36479–36494
-
[61]
Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp
Sauer, A., Schwarz, K., Geiger, A., 2022. Stylegan-xl: Scaling stylegan to large diverse datasets, in: ACM SIGGRAPH 2022 conference proceedings, pp. 1–10
2022
-
[62]
Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Shi,J.,Xiong,W.,Lin,Z.,Jung,H.J.,2024a. Instantbooth:Personalizedtext-to-imagegenerationwithouttest-timefinetuning,in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8543–8552
-
[63]
Multi-scale vmamba: Hierarchy in hierarchy visual state space model
Shi, Y., Dong, M., Xu, C., 2024b. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. arXiv preprint arXiv:2405.14174
-
[64]
10819–10829
Shi,Y.,Li,M.,Dong,M.,Xu,C.,2025.Vssd:Visionmambawithnon-causalstatespaceduality,in:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision, pp. 10819–10829
2025
-
[65]
Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S., 2015. Deep unsupervised learning using nonequilibrium thermodynamics, in: International conference on machine learning, PMLR. pp. 2256–2265
2015
-
[66]
Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp
Sun, G., Hua, Y., Hu, G., Robertson, N., 2021. Mamba: Multi-level aggregation via memory bank for video object detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2620–2627
2021
-
[67]
7184–7192
Tan,C.,Tao,R.,Liu,H.,Gu,G.,Wu,B.,Zhao,Y.,Wei,Y.,2025.C2p-clip:Injectingcategorycommonpromptincliptoenhancegeneralization in deepfake detection, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7184–7192
2025
-
[68]
Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp
Tan,C.,Zhao,Y.,Wei,S.,Gu,G.,Liu,P.,Wei,Y.,2024a. Frequency-awaredeepfakedetection:Improvinggeneralizabilitythroughfrequency space domain learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5052–5060
-
[69]
Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y., 2024b. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28130– 28139
-
[70]
Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y., 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12105–12114
2023
-
[71]
Scalable visual state space model with fractal scanning
Tang, L., Xiao, H., Jiang, P.T., Zhang, H., Chen, J., Li, B., 2024. Scalable visual state space model with fractal scanning. arXiv preprint arXiv:2405.14480
-
[72]
Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training data-efficient image transformers & distillation through attention, in: International conference on machine learning, PMLR. pp. 10347–10357
2021
-
[73]
Powersgd: Practical low-rank gradient compression for distributed optimization
Vogels, T., Karimireddy, S.P., Jaggi, M., 2019. Powersgd: Practical low-rank gradient compression for distributed optimization. Advances in Neural Information Processing Systems 32
2019
-
[74]
Mamba-r: Vision mamba also needs registers
Wang, F., Wang, J., Ren, S., Wei, G., Mei, J., Shao, W., Zhou, Y., Yuille, A., Xie, C., 2024. Mamba-r: Vision mamba also needs registers. arXiv preprint arXiv:2405.14858
-
[75]
Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp
Wang,S.Y.,Wang,O.,Zhang,R.,Owens,A.,Efros,A.A.,2020. Cnn-generatedimagesaresurprisinglyeasytospot...fornow,in:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8695–8704
2020
-
[76]
Dire for diffusion-generated image detection
Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H., 2023. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295
-
[77]
Fd-gan: Generalizable and robust forgery detection via generative adversarial networks
Xu, N., Feng, W., Zhang, T., Zhang, Y., 2024. Fd-gan: Generalizable and robust forgery detection via generative adversarial networks. International Journal of Computer Vision , 1–19
2024
-
[78]
Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp
Xu, Y., Liang, J., Jia, G., Yang, Z., Zhang, Y., He, R., 2023. Tall: Thumbnail layout for deepfake video detection, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 22658–22668
2023
-
[79]
Raphael: Text-to-image generation via large mixture of diffusion paths
Xue, Z., Song, G., Guo, Q., Liu, B., Zong, Z., Liu, Y., Luo, P., 2024. Raphael: Text-to-image generation via large mixture of diffusion paths. Advances in Neural Information Processing Systems 36
2024
-
[80]
Plainmamba: Improving non-hierarchical mamba in visual recognition
Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., Crowley, E.J., 2024. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv preprint arXiv:2403.17695
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.