pith. sign in

arxiv: 2606.18554 · v1 · pith:GRK3BIS4new · submitted 2026-06-17 · 💻 cs.CV

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

Pith reviewed 2026-06-26 21:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords synthetic image detectiondiffusion modelsdisaster imagerybenchmark datasetcross-domain generalizationAI forensicsdeepfake detectionimage authenticity
0
0 comments X

The pith

Detectors for AI-generated disaster images lose up to 50% accuracy when tested on new diffusion models or disaster types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark of 30,000 real and synthetic images of floods, fires, and earthquakes produced by four different diffusion models to test current detection methods. It finds that fine-tuned detectors succeed only on the exact models and disaster categories used in training but drop sharply on anything else because they latch onto generator-specific artifacts. Zero-shot detectors fare no better and show unstable results across the board. If true, this means existing tools cannot reliably protect against fake disaster imagery in cybersecurity, forensics, or emergency response, where such images could spread quickly and cause real harm.

Core claim

The Forged Calamity benchmark contains 6,000 real images and 24,000 synthetic ones from four diffusion models across multiple disaster categories. Comprehensive tests show fine-tuned detectors achieve strong in-distribution performance but suffer accuracy losses of up to 50% on unseen generators or disaster types due to overfitting on model-specific cues. Zero-shot generalized detectors maintain only limited and unstable accuracy, with few models showing any resilience. These results establish persistent cross-domain and cross-model generalization gaps in current forensic approaches.

What carries the argument

The Forged Calamity benchmark dataset together with its cross-generator and cross-disaster evaluation protocol, which measures how detection accuracy changes when both the image source and the disaster category are held out.

If this is right

  • Fine-tuned detectors cannot be trusted for real-world use because they overfit to artifacts unique to each diffusion model.
  • Zero-shot detectors lack the stability needed to handle the growing variety of text-to-image generators.
  • Detection methods must move beyond model-specific cues toward features that remain consistent across generators.
  • The benchmark exposes the need for new training strategies that explicitly target cross-domain robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future detectors might improve by training on mixtures of many generators rather than single ones.
  • Similar generalization failures likely appear in other domains such as synthetic medical or news imagery.
  • The benchmark could serve as a standard testbed for any new forensic method claiming broad applicability.

Load-bearing premise

The four selected diffusion models and the chosen disaster categories form a representative sample of all possible synthetic disaster images.

What would settle it

A detector that keeps accuracy above 85% on every held-out combination of generator and disaster type within the 30,000-image benchmark.

Figures

Figures reproduced from arXiv: 2606.18554 by Anh-Tuan Vo, Duc-Manh Phan, Duy-Khang Do, Hai-Dang Nguyen, Isao Echizen, Mai-Khiem Tran, Minh-Triet Tran, Quoc-Duy Tran, Tam V. Nguyen, Trong Le Do, Trung-Nghia Le, Vinh-Tiep Nguyen.

Figure 1
Figure 1. Figure 1: Representative examples of natural disaster categories from the Incident [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the synthetic image generation pipeline. The process consists [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of AI-generated images within our Forged Calamity dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure cases, tested on SigLipv2. gesting that these models may not possess a sufficiently holistic understanding of both global and local image information. They often fail to preserve coherent color composition and tend to overlook subtle local anomalies, such as irregular text or unnatural object details, which ultimately prevents them from reliably distinguishing fake images from real ones. 5 Conclusi… view at source ↗
read the original abstract

The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the Forged Calamity benchmark dataset of 30,000 images (6,000 real photographs and 24,000 synthetic disaster images generated by four diffusion models) and reports experiments showing that fine-tuned detectors achieve strong in-distribution performance but suffer accuracy drops of up to 50% on unseen generators or disaster types due to overfitting to model-specific artifacts, while zero-shot detectors also exhibit unstable accuracy across domains.

Significance. If the reported cross-domain and cross-model generalization gaps are confirmed with additional controls, the work would provide a useful domain-specific benchmark for synthetic disaster imagery detection and empirically document concrete limitations of current forensic methods in a high-stakes application area.

major comments (2)
  1. [Abstract] Abstract: the central claim of up to 50% accuracy drops on unseen generators rests on the four diffusion models being representative of the broader space of synthetic disaster imagery; no justification is given for model selection or diversity of training corpora and artifact patterns, which directly affects whether the observed overfitting generalizes beyond the chosen set.
  2. [Abstract] Abstract: the reported accuracy drops and overfitting observations lack accompanying details on data splits, statistical tests, or controls for confounds such as image resolution or disaster-type balance, which are required to verify the generalization claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the need for stronger support of our generalization claims. We address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of up to 50% accuracy drops on unseen generators rests on the four diffusion models being representative of the broader space of synthetic disaster imagery; no justification is given for model selection or diversity of training corpora and artifact patterns, which directly affects whether the observed overfitting generalizes beyond the chosen set.

    Authors: We agree the abstract provides no justification for selecting the four diffusion models or for claiming they represent broader artifact diversity. In revision we will add a concise statement to the abstract noting the models span open-source and closed-source systems with distinct training regimes, and we will expand the methods section with explicit discussion of their coverage of artifact patterns. This directly addresses whether the reported drops generalize. revision: yes

  2. Referee: [Abstract] Abstract: the reported accuracy drops and overfitting observations lack accompanying details on data splits, statistical tests, or controls for confounds such as image resolution or disaster-type balance, which are required to verify the generalization claims.

    Authors: We acknowledge the abstract omits these details. The full manuscript already specifies an 80/20 per-domain split, uniform resizing to 512x512, and balanced disaster-type sampling, but we will revise the abstract to reference these controls explicitly and add statistical significance testing (paired t-tests across runs) to the experiments section to substantiate the accuracy drops. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivations or self-referential predictions

full rationale

The paper introduces the Forged Calamity dataset (30k images from 4 diffusion models) and reports direct experimental results on fine-tuned and zero-shot detectors. No equations, fitted parameters, predictions, or uniqueness theorems are present. All claims (e.g., up to 50% accuracy drop) are empirical measurements on the introduced data, with no reduction to inputs by construction or self-citation chains. This matches the default non-circular case for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper; no free parameters, mathematical axioms, or invented entities are introduced or required for the central claims.

pith-pipeline@v0.9.1-grok · 5772 in / 1122 out tokens · 30183 ms · 2026-06-26T21:30:43.907497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 22 canonical work pages · 8 internal anchors

  1. [1]

    AI@Meta: Llama 3 model card.https://github.com/meta-llama/llama3/blob/ main/MODEL_CARD.md(2024)

  2. [2]

    CrisisMMD: Multimodal Twitter Datasets from Natural Disasters

    Alam,F.,Ofli,F.,Imran,M.:Crisismmd:Multimodaltwitterdatasetsfromnatural disasters. arXiv preprint arXiv:1805.00713 (2018),https://arxiv.org/abs/1805. 00713

  3. [3]

    arXiv preprint arXiv:2403.04692 (2024),https://arxiv.org/ abs/2403.04692

    Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: PixArt-Sigma: Weak-to-strong training of diffusion transformer for 4k text-to- image generation. arXiv preprint arXiv:2403.04692 (2024),https://arxiv.org/ abs/2403.04692

  4. [4]

    Cioni, D., Tzelepis, C., Seidenari, L., Patras, I.: Are clip features all you need for universal synthetic image origin attribution? (2024)

  5. [5]

    arXiv preprint arXiv:2211.00680 (2022),https://arxiv.org/abs/2211.00680

    Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., Verdoliva, L.: On the detection of synthetic images generated by diffusion models. arXiv preprint arXiv:2211.00680 (2022),https://arxiv.org/abs/2211.00680

  6. [6]

    Diffusion Models Beat GANs on Image Synthesis

    Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. arXiv preprint arXiv:2105.05233 (2021),https://arxiv.org/abs/2105.05233

  7. [7]

    arXiv preprint arXiv:2003.08685 (2020),https://arxiv.org/abs/2003.08685

    Frank, J., Eisenhofer, T., Sch¨ onherr, L., Fischer, A., Kolossa, D., Holz, T.: Leveraging frequency analysis for deep fake image recognition. arXiv preprint arXiv:2003.08685 (2020),https://arxiv.org/abs/2003.08685

  8. [8]

    Generative Adversarial Networks

    Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. arXiv preprint arXiv:1406.2661 (2014),https://arxiv.org/abs/1406.2661

  9. [9]

    Journal of Imaging8(10), 263 (Sep 2022).https://doi.org/10.3390/ jimaging8100263

    Guarnera, L., Giudice, O., Guarnera, F., et al.: The face deepfake detection chal- lenge. Journal of Imaging8(10), 263 (Sep 2022).https://doi.org/10.3390/ jimaging8100263

  10. [10]

    arXiv preprint arXiv:1911.09296 (2019),https://arxiv.org/abs/arXiv: 1911.09296 14 Duc-Manh Phan et al

    Gupta, R., Hosfelt, R., Sajeev, S., Patel, N., Goodman, B., Doshi, J., Heim, E., Choset, H., Gaston, M.: xbd: A dataset for assessing building damage from satellite imagery. arXiv preprint arXiv:1911.09296 (2019),https://arxiv.org/abs/arXiv: 1911.09296 14 Duc-Manh Phan et al

  11. [11]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  12. [12]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Karageorgiou, D., Papadopoulos, S., Kompatsiaris, I., Gavves, E.: Any-resolution ai-generated image detection by spectral learning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18706–18717 (2025)

  13. [13]

    A Style-Based Generator Architecture for Generative Adversarial Networks

    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948 (2019),https://arxiv. org/abs/1812.04948

  14. [14]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate encoder-blocks for synthetic image detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 394–411. Springer Nature Switzerland, Cham (2025)

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive transformer for generalizable synthetic image detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  16. [16]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021),https://arxiv.org/abs/2103.14030

  17. [17]

    First Monday30(4) (Apr 2025).https://doi.org/10.5210/fm.v30i4.13799,https: //firstmonday.org/ojs/index.php/fm/article/view/13799

    Monahan, L., Martin, M., Zaitsev, A., Bartelt, V., Noordeen, A.R.: Do I spy AI? The impact of AI-generated images on trust and donation behaviors. First Monday30(4) (Apr 2025).https://doi.org/10.5210/fm.v30i4.13799,https: //firstmonday.org/ojs/index.php/fm/article/view/13799

  18. [18]

    arXiv preprint arXiv:2506.19261 (2025)

    Nguyen, Q.B., Hoang, T.V., Tran, N.D., Nguyen, T.V., Tran, M.T., Le, T.N.: Automated image recognition framework. arXiv preprint arXiv:2506.19261 (2025)

  19. [19]

    In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T

    Nguyen, Y.H., Le, T.N.: Decoding deepfakes: Caption guided learning for robust deepfake detection. In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technol- ogy. pp. 77–87. Springer Nature Singapore (2025).https://doi.org/10.1007/ 978-981-96-4282-3_7

  20. [20]

    Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that generalize across generative models (2023)

  21. [21]

    org/abs/2302.10174

    Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that generalize acrossgenerativemodels.arXivpreprintarXiv:2302.10174(2024),https://arxiv. org/abs/2302.10174

  22. [22]

    Transactions on Ma- chine Learning Research (2024),https://openreview.net/forum?id=a68SUt6zFt

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual fe...

  23. [23]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution im- age synthesis. arXiv preprint arXiv:2307.01952 (2023),https://arxiv.org/abs/ 2307.01952

  24. [24]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020 Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection 15

  25. [25]

    arXiv preprint arXiv:2012.02951 (2020),https://arxiv.org/abs/ arXiv:2012.02951

    Rahnemoonfar, M., Chowdhury, T., Sarkar, A., Varshney, D., Yari, M., Murphy, R.: Floodnet: A high resolution aerial imagery dataset for post flood scene un- derstanding. arXiv preprint arXiv:2012.02951 (2020),https://arxiv.org/abs/ arXiv:2012.02951

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684– 10695 (June 2022)

  27. [27]

    Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Frequency-aware deepfake detection: Improving generalizability through frequency space learning (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: Generalized artifacts representation for gan-generated images detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12105– 12114 (2023)

  29. [29]

    IEEE Journal of Se- lected Topics in Signal Processing14(5), 910–932 (Aug 2020).https://doi.org/ 10.1109/jstsp.2020.3002101

    Verdoliva, L.: Media forensics and DeepFakes: An overview. IEEE Journal of Se- lected Topics in Signal Processing14(5), 910–932 (Aug 2020).https://doi.org/ 10.1109/jstsp.2020.3002101

  30. [30]

    arXiv preprint arXiv:2411.06441 (2024)

    Vesnin, D., Levshun, D., Chechulin, A.: Detecting autoencoder is enough to catch ldm generated images. arXiv preprint arXiv:2411.06441 (2024)

  31. [31]

    vikhyatk: moondream2 (Revision 92d3d73) (2024).https://doi.org/10.57967/ hf/3219,https://huggingface.co/vikhyatk/moondream2

  32. [32]

    In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T

    Vo, H.D., Le, T.N.: Minimalist preprocessing approach for image synthe- sis detection. In: Buntine, W., Fjeld, M., Tran, T., Tran, M.T., Huynh Thi Thanh, B., Miyoshi, T. (eds.) Information and Communication Technol- ogy. pp. 88–99. Springer Nature Singapore (2025).https://doi.org/10.1007/ 978-981-96-4282-3_8

  33. [33]

    Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion- generated image detection (2023)

  34. [34]

    arXiv preprint arXiv:2201.04236 (2022),https://arxiv.org/abs/2201

    Weber, E., Papadopoulos, D.P., Lapedriza, A., Ofli, F., Imran, M., Torralba, A.: Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents. arXiv preprint arXiv:2201.04236 (2022),https://arxiv.org/abs/2201. 04236

  35. [35]

    arXiv preprint arXiv:2301.00808 (2023)

    Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. arXiv preprint arXiv:2301.00808 (2023),https://arxiv.org/abs/2301.00808

  36. [36]

    arXiv preprint arXiv:2006.03677 (2020),https: //arxiv.org/abs/2006.03677

    Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z., Tomizuka, M., Gonzalez, J., Keutzer, K., Vajda, P.: Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020),https: //arxiv.org/abs/2006.03677

  37. [37]

    Sigmoid Loss for Language Image Pre-Training

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343 (2023),https://arxiv.org/abs/ 2303.15343

  38. [38]

    arXiv preprint arXiv:2012.09311 (2021),https://arxiv

    Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., Xia, W.: Learning self-consistency for deepfake detection. arXiv preprint arXiv:2012.09311 (2021),https://arxiv. org/abs/2012.09311