pith. machine review for the scientific record. sign in

arxiv: 2604.07101 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.AI· cs.MM· eess.IV

Recognition: no theorem link

SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

Christopher Leckie, Guansong Pang, Qizhou Wang

Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MMeess.IV
keywords surveillance image forgeryforgery detectiontampering localisationimage datasetmultimodal generationforensic analysisimage editing models
0
0 comments X

The pith

SurFITR dataset shows that forgery detectors trained on standard images degrade on surveillance tampering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SurFITR, a collection of over 137,000 tampered surveillance images created to fill a gap in forgery detection research. Current models, trained mostly on full-image changes or large edits in clear object-focused photos, perform poorly when tampering is small, subjects are distant or blocked, viewpoints vary, and image quality is low. The authors generate the data with a pipeline that combines language models and image editors to produce realistic, semantically consistent alterations across many surveillance scenes. Tests confirm existing detectors lose accuracy on SurFITR while new training on the dataset raises performance both within the same domain and across others.

Core claim

SurFITR supplies a large set of forensically relevant tampered surveillance images with varied resolutions and edit types, produced through a multimodal LLM-powered pipeline for fine-grained editing, and experiments establish that existing detectors degrade significantly on this data while training on SurFITR brings substantial gains in both in-domain and cross-domain detection and localisation.

What carries the argument

The multimodal LLM-powered pipeline that generates semantically aware, fine-grained edits in diverse surveillance scenes with small or occluded subjects and lower visual quality.

If this is right

  • Detectors trained only on conventional forgery datasets will show marked drops in performance when applied to surveillance imagery.
  • Training or fine-tuning on SurFITR will improve both detection accuracy and localisation precision for in-domain and cross-domain surveillance cases.
  • The dataset enables systematic testing of forgery methods across different edit scales, image resolutions, and scene complexities typical of surveillance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could support development of specialised localisation techniques that focus on small regions rather than whole-image analysis.
  • Combining SurFITR with real captured tampered surveillance footage would provide a stronger test of whether the generated examples transfer to actual forensic evidence.
  • The generation approach may extend to creating paired video sequences for testing temporal forgery detection in CCTV streams.

Load-bearing premise

The generated tampered images accurately capture the localised, subtle, and forensically relevant characteristics of real-world surveillance tampering.

What would settle it

A demonstration that existing detectors maintain high accuracy on SurFITR without retraining, or that training on SurFITR produces no measurable improvement on independent real surveillance forgery test sets, would undermine the central claim.

Figures

Figures reproduced from arXiv: 2604.07101 by Christopher Leckie, Guansong Pang, Qizhou Wang.

Figure 1
Figure 1. Figure 1: Visualisations from SurFITR showing realistic, fine grained tampering across diverse surveillance scenes. Top row: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between SurFITR and prior datasets [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the SurFITR MLLM-powered tamper [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
read the original abstract

We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SurFITR, a dataset of over 137k tampered surveillance images generated via a multimodal LLM-powered pipeline for semantically aware, fine-grained editing. It claims that existing forgery detectors trained on full-image synthesis or large-region manipulations in object-centric images fail to generalize to surveillance scenarios involving localized, subtle edits in varied-viewpoint, low-quality scenes with small or occluded subjects. Experiments are reported to show significant performance degradation of prior detectors on SurFITR and substantial gains in both in-domain and cross-domain settings when models are trained on the new dataset.

Significance. If the generated edits prove representative of authentic surveillance tampering, SurFITR would address a clear gap in image forensics by supplying a large-scale, domain-specific benchmark. The scale (137k+ images across multiple editing models and resolutions) and the reported cross-domain improvements could support development of more robust detectors for real-world evidence verification, where surveillance footage is common. Public release on GitHub further aids reproducibility.

major comments (2)
  1. [Dataset construction and Experiments sections] The central claims of detector degradation and retraining gains rest on the premise that the LLM pipeline produces tampered images that are forensically realistic and representative of real surveillance tampering (localized, subtle edits). No validation is described—such as quantitative comparison of edit sizes/locations to real tampered surveillance data, perceptual studies, or forensic-expert assessment of artifact realism—that would confirm this assumption holds for the reported generalization results.
  2. [Experiments] The abstract and results summary state that existing detectors 'degrade significantly' and training yields 'substantial improvements,' yet the provided description lacks concrete metrics (e.g., AUC, F1, IoU for localization), baseline tables with error bars, or statistical tests. Without these, the magnitude and reliability of the claimed gains cannot be assessed.
minor comments (2)
  1. [Dataset description] Clarify the exact number of editing models used and the distribution of edit types (e.g., object insertion vs. removal) across the 137k images to allow readers to judge diversity.
  2. [Introduction] Ensure that all prior forgery datasets referenced in the introduction and related work are cited with their original publication details and access information.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on dataset validation and the need for clearer quantitative reporting. We address each major comment below and will revise the manuscript to strengthen these aspects.

read point-by-point responses
  1. Referee: [Dataset construction and Experiments sections] The central claims of detector degradation and retraining gains rest on the premise that the LLM pipeline produces tampered images that are forensically realistic and representative of real surveillance tampering (localized, subtle edits). No validation is described—such as quantitative comparison of edit sizes/locations to real tampered surveillance data, perceptual studies, or forensic-expert assessment of artifact realism—that would confirm this assumption holds for the reported generalization results.

    Authors: We agree that explicit validation of forensic realism strengthens the claims. The multimodal LLM pipeline was engineered to generate semantically coherent, localized edits suited to surveillance characteristics (varied viewpoints, small/occluded subjects, lower quality), which differs from prior datasets. Direct comparison to real tampered surveillance data is not feasible at scale because no large public datasets of verified real-world surveillance forgeries with ground-truth masks exist. In revision we will add quantitative statistics on generated edit sizes, locations, and semantic categories in the Dataset Construction section, include a limitations discussion on this point, and report a small-scale perceptual study with non-expert raters to assess visual plausibility. revision: partial

  2. Referee: [Experiments] The abstract and results summary state that existing detectors 'degrade significantly' and training yields 'substantial improvements,' yet the provided description lacks concrete metrics (e.g., AUC, F1, IoU for localization), baseline tables with error bars, or statistical tests. Without these, the magnitude and reliability of the claimed gains cannot be assessed.

    Authors: We acknowledge that the summary-level claims require supporting numbers for full assessment. The full Experiments section already reports AUC, F1, and IoU values for detection and localization across baselines, in-domain, and cross-domain settings. To address the concern we will revise the section to present these results in expanded tables that include error bars from multiple random seeds, add statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank), and ensure key quantitative findings are referenced more explicitly in the abstract and results summary. revision: yes

standing simulated objections not resolved
  • Direct quantitative comparison of edit sizes/locations to real tampered surveillance data, because no sufficiently large public dataset of verified real-world surveillance forgeries with localization masks is available.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces SurFITR, a new dataset of tampered surveillance images generated via a multimodal LLM pipeline, and evaluates external forgery detectors on it. Claims rest on empirical results showing detector degradation on SurFITR and gains from training on it. No equations, parameter fits, or derivations are present. No load-bearing self-citations or uniqueness theorems reduce any result to the paper's own inputs by construction. The work is a self-contained empirical contribution benchmarked against independent detectors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM-guided editing produces realistic, forensically representative tampering without independent verification of edit subtlety or scene diversity beyond the generation process itself.

axioms (1)
  • domain assumption Multimodal LLMs combined with image editing models can generate semantically aware, fine-grained, and forensically valuable edits that mimic real surveillance tampering.
    Invoked in the description of the data generation pipeline as the basis for creating the 137k images.

pith-pipeline@v0.9.0 · 5489 in / 1407 out tokens · 76225 ms · 2026-05-10T18:19:28.123097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 17 canonical work pages · 14 internal anchors

  1. [1]

    [n. d.]. Crime Stoppers Australia. https://www.crimestoppers.com.au/. Accessed: 2026

  2. [2]

    [n. d.]. FBI Tips and Public Leads Portal. https://tips.fbi.gov/. Accessed: 2026

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  4. [4]

    Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096 (2018)

  5. [5]

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. 2024. DRCT: Diffusion Reconstruction Contrastive Training towards Universal Detection of Diffusion Generated Images. InProceedings of the 41st International Conference on Machine Learning. 7621–7639

  6. [6]

    Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. 2021. Image Manipulation Detection by Multi-View Multi-Scale Supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 14185–14193

  7. [7]

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan

  8. [8]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16901–16911

  9. [9]

    Google DeepMind. 2025. Gemini 3. https://blog.google/products-and-platforms/ products/gemini/gemini-3/

  10. [10]

    Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge (DFDC) Dataset. arXiv:2006.07397 [cs.CV]

  11. [11]

    Jing Dong, Wei Wang, and Tieniu Tan. 2013. CASIA Image Tampering Detection Evaluation Database. In2013 IEEE China Summit and International Conference on Signal and Information Processing. 422–426. doi:10.1109/chinasip.2013.6625374

  12. [12]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Dominik Lorenz, et al . 2024. Scaling Rectified Flow Transformers for High- Resolution Image Synthesis.arXiv preprint arXiv:2403.03206(2024)

  13. [13]

    Gemini Team, Google. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]

  14. [14]

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks.Commun. ACM63, 11 (2020), 139–144

  15. [15]

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. 2023. TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 20606–20615

  16. [16]

    Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. 2023. Hierarchical Fine-Grained Image Forgery Detection and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3155–3165

  17. [17]

    Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. 2021. ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4360–4369

  18. [18]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems, Vol. 33. 6840–6851

  19. [19]

    Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar- chitecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410

  20. [20]

    Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

  21. [21]

    Myung-Joon Kwon, In-Jae Yu, Seung-Hun Nam, and Heung-Kyu Lee. 2021. CAT- Net: Compression Artifact Tracing Network for Detection and Localization of Image Splicing. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 375–384

  22. [22]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  23. [23]

    Black Forest Labs. 2025. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/ flux-2

  24. [24]

    Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. 2014. Anomaly Detection and Localization in Crowded Scenes.IEEE Transactions on Pattern Analysis and Machine Intelligence36, 1 (2014), 18–32

  25. [25]

    Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–a new baseline. InProceedings of the IEEE conference on computer vision and pattern recognition. 6536–6545

  26. [26]

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. 2022. PSCC-Net: Progres- sive Spatio-Channel Correlation Network for Image Manipulation Detection and Localization.IEEE Transactions on Circuits and Systems for Video Technology32, 11 (2022), 7505–7517

  27. [27]

    Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal Event Detection at 150 FPS in MATLAB. InProceedings of the IEEE International Conference on Computer Vision (ICCV). 2720–2727

  28. [28]

    arXiv preprint arXiv:2307.14863 (2023) 14 Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, and Trong-Le Do

    Xiaochen Ma, Bo Du, Zhuohang Jiang, Xia Du, Ahmed Y. Al Hammadi, and Jizhe Zhou. 2023. IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer. arXiv:2307.14863 [cs.CV]

  29. [29]

    Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, and Jizhe Zhou. 2024. IMDL-BenCo: A Comprehensive Benchmark and Codebase for Image Manipulation Detection & Localization. InAdvances in Neural Information Processing Systems, Vol. 37. 134591–134613

  30. [30]

    Vijay Mahadevan, Wei-Xin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anom- aly Detection in Crowded Scenes. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR). 1975–1981

  31. [31]

    2004.A Data Set of Authentic and Spliced Image Blocks

    Tian-Tsong Ng and Shih-Fu Chang. 2004.A Data Set of Authentic and Spliced Image Blocks. Technical Report 203-2004-3. Columbia University

  32. [32]

    Adam Novozamsky, Babak Mahdian, and Stanislav Saic. 2020. IMD2020: A Large-Scale Annotated Dataset Tailored for Detecting Manipulated Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV) Workshops. 71–80

  33. [33]

    Mauricio Perez, Alex C Kot, and Anderson Rocha. 2019. Detection of real-world fights in surveillance videos. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2662–2666

  34. [34]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2024. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. InInternational Conference on Learning Representations

  35. [35]

    Qwen Team. 2025. Qwen3-VL Technical Report. arXiv:2511.21631 [cs.CV]

  36. [36]

    Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representa- tion learning with deep convolutional generative adversarial networks.arXiv preprint arXiv:1511.06434(2015)

  37. [37]

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen

  38. [38]

    Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.061251, 2 (2022), 3

  39. [39]

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. 2024. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714(2024)

  40. [40]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695

  41. [41]

    Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 1–11

  42. [42]

    Chitwan Saharia et al. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.arXiv preprint arXiv:2205.11487(2022)

  43. [43]

    Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly de- tection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recognition. 6479–6488

  44. [44]

    Z-Image Team. 2025. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer.arXiv preprint arXiv:2511.22699(2025). Preprint, Apr 08, 2026, Arxiv Wang et al

  45. [45]

    Dijana Tralic, Ivan Zupancic, Sonja Grgic, and Mislav Grgic. 2013. CoMoFoD: New Database for Copy-Move Forgery Detection. InProceedings ELMAR-2013. 49–54

  46. [46]

    Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. 2022. ObjectFormer for Image Manipulation Detection and Localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2364–2373

  47. [47]

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingku...

  48. [48]

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. 2024. DeepSeek-VL2: Mixture-of-Experts...

  49. [49]

    Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang

  50. [50]

    InInternational Conference on Learning Representations

    FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models. InInternational Conference on Learning Representations

  51. [51]

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024. Depth Anything V2. InAdvances in Neural Information Processing Systems

  52. [52]

    Candi Zheng, Yuan Lan, and Yang Wang. 2025. LanPaint: Training-Free Diffusion Inpainting with Asymptotically Exact and Fast Conditional Sampling.Transac- tions on Machine Learning Research(2025). https://openreview.net/forum?id= JPC8JyOUSW

  53. [53]

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2023. GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image. arXiv:2306.08571 [cs.CV]