pith. machine review for the scientific record. sign in

arxiv: 2605.12002 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

Minh-Hoang Le, Minh-Khoa Le-Phan, Minh-Triet Tran, Trong-Le Do

Pith reviewed 2026-05-13 07:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords image forgery localizationedge-guided segmentationsynthetic heatmapcross-domain generalizationfrequency edge detectionpatch-based localizationmulti-resolution processing
0
0 comments X

The pith

A dual-branch network localizes image forgeries at native resolution by fusing frequency edge cues with patch-level synthetic priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a patch-based method to find manipulated regions in images that have been edited with realistic text-guided inpainting. One branch detects high-frequency inconsistencies at manipulation boundaries using a frequency edge detector and a fine-tuned segmentation model on RGB plus edge features. The second branch uses a fine-tuned vision transformer to classify entire patches as fully synthetic, supplying coarse location hints. These two signals complement each other so the system can handle images of any size without downsampling and still work across different editing styles or datasets. The result is claimed to be accurate localization that stays resolution-agnostic.

Core claim

The central claim is that a dual-branch architecture, consisting of an Edge-Guided Segmentation module that emphasizes manipulation boundaries via frequency analysis and a Synthetic Heatmapping module that flags fully synthetic patches, produces comprehensive forgery masks for arbitrary-resolution images while maintaining strong performance when the test images come from domains different from the training data.

What carries the argument

The dual-branch EDGER framework, in which frequency-based edge features sharpen boundaries inside mixed patches while patch-level synthetic classification supplies coarse location priors.

If this is right

  • The method processes multi-megapixel images at their original size instead of requiring downsampling.
  • Localization accuracy improves when edge cues and synthetic patch scores are combined rather than used separately.
  • The approach supports cross-domain testing without retraining for each new image source or manipulation type.
  • Patch-level decisions allow the system to flag fully synthetic regions even when boundary edges are weak.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same edge-plus-classification split could be tested on video forgery detection by treating frames as patches.
  • If the frequency edge detector is replaced with a learned edge model, the framework might adapt to smoother manipulation styles.
  • The coarse-to-fine refinement pattern might apply to other localization tasks such as medical image anomaly detection.

Load-bearing premise

Edge information is useful mainly in patches that mix real and fake pixels, and fine-tuning the segmentation and classification models will produce reliable results on unseen image domains.

What would settle it

Running the system on a set of images whose manipulations have been smoothed to remove high-frequency boundary signals, or on a new editing domain where both branches fail to flag the altered regions, would show whether the localization remains accurate.

Figures

Figures reproduced from arXiv: 2605.12002 by Minh-Hoang Le, Minh-Khoa Le-Phan, Minh-Triet Tran, Trong-Le Do.

Figure 1
Figure 1. Figure 1: Overall localization pipeline. A patch-based dual-branch design: (i) the EGS branch injects a frequency-based edge prior (Section 3.1) into a SegFormer decoder; non-overlapping patches, padded to sizes divisible by 32, are stitched to full resolution. (ii) The SH branch applies a CLIP-ViT classifier over sliding windows with Hann blending to form a dense heatmap. The two outputs are fused by pixel-wise max… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of the Frequency-based Edge Detector. The network extracts high￾pass and SRM-like residuals, computes fixed Sobel and Laplacian gradients from RGB, and fuses them through a multi-branch convolutional head composed of residual and dilated context blocks with global pooling. The head integrates frequency, gradient, and structural cues to predict soft edge probability maps supervised by multi-scale t… view at source ↗
Figure 3
Figure 3. Figure 3: Predicted edge maps and their augmented counterparts used to simulate chal￾lenging test-time conditions. Augmentations lightly degrade or fragment boundaries (local background mixing, temperature softening, Gaussian blur, segment breaks, band￾limited noise) to improve robustness. 4.2 Comparison with Existing Localization Methods We compare our framework with representative manipulation localization mod￾els… view at source ↗
Figure 4
Figure 4. Figure 4: Each group of five images shows, from left to right, the EGS mask, the SH mask, the fuse mask, the ground-truth mask, and the original image. In (a)–(c), showing large images, SH supplies strong region cues that complement EGS. In (d), showing a small image, EGS captures fine details and performs well even without SH. The fused output combines these complementary strengths. 4.4 Qualitative Result In Fig.4,… view at source ↗
Figure 5
Figure 5. Figure 5: Each triplet shows, from left to right, the SH mask, the ground-truth mask, and the original image. In (a)–(b), showing large images, SH captures coarse regions but misses fine boundaries and small structures. In (c)-(d), showing small images, SH produces blurry responses and overlooks many manipulated pixels. This shows that SH alone is insufficient without complementary edge guidance and fusion. In Fig.4… view at source ↗
read the original abstract

Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes EDGER, a patch-based dual-branch framework for localizing manipulated regions in arbitrary-resolution images. The Edge-Guided Segmentation branch uses a Frequency-based Edge Detector to highlight high-frequency boundary inconsistencies and fine-tunes SegFormer to fuse RGB and edge features into pixel-level masks. The Synthetic Heatmapping branch fine-tunes a CLIP-ViT encoder with LoRA to classify fully synthetic patches and supply coarse priors. These branches are combined to produce resolution-agnostic localization, with the central claim being strong cross-domain generalization on the MediaEval 2025 SynthIM challenge.

Significance. If the generalization claims hold with supporting metrics, the work could advance forgery localization by addressing text-guided inpainting challenges through complementary edge and synthetic cues, offering a practical solution for multi-megapixel imagery without resolution downsampling.

major comments (2)
  1. [§3.2] §3.2 (Synthetic Heatmapping branch): The premise that the CLIP-ViT LoRA branch reliably flags fully synthetic patches under domain shift, while SegFormer sharpens boundaries only in mixed patches, is asserted without quantitative measures of branch agreement, conflict rates, or failure cases on held-out domains (e.g., new inpainting models or compression); this is load-bearing for the cross-domain generalization claim.
  2. [§4] §4 (Evaluation): The abstract states that the method 'exhibits strong cross-domain generalization' and 'scales to multi-megapixel imagery' with 'extensive ablations,' yet no performance metrics, baseline comparisons, ablation tables, or results on held-out domains are referenced; without these, the central claims cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed the major comments point by point below, making revisions to improve the clarity of our contributions and results.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Synthetic Heatmapping branch): The premise that the CLIP-ViT LoRA branch reliably flags fully synthetic patches under domain shift, while SegFormer sharpens boundaries only in mixed patches, is asserted without quantitative measures of branch agreement, conflict rates, or failure cases on held-out domains (e.g., new inpainting models or compression); this is load-bearing for the cross-domain generalization claim.

    Authors: We concur that quantitative validation of the branch-specific behaviors under domain shift is important for supporting the generalization claims. Accordingly, we have revised §3.2 to include metrics on branch agreement, conflict rates, and failure case analysis using held-out domains from the challenge, including variations in inpainting models and compression. The added results indicate low conflict rates and high reliability of the Synthetic Heatmapping branch in flagging fully synthetic patches, thereby reinforcing the dual-branch design. revision: yes

  2. Referee: [§4] §4 (Evaluation): The abstract states that the method 'exhibits strong cross-domain generalization' and 'scales to multi-megapixel imagery' with 'extensive ablations,' yet no performance metrics, baseline comparisons, ablation tables, or results on held-out domains are referenced; without these, the central claims cannot be assessed.

    Authors: We appreciate this observation. While Section 4 presents the evaluation results on the MediaEval 2025 SynthIM challenge, including metrics, baselines, ablations, and cross-domain performance, we acknowledge that the abstract and main text could reference them more explicitly. In the revised manuscript, we have updated the abstract and expanded §4 with clearer references to the tables and figures, as well as additional details on scalability to multi-megapixel images and held-out domains. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive dual-branch engineering framework with no derivation chain

full rationale

The paper describes a patch-based dual-branch architecture (Edge-Guided Segmentation using a Frequency-based Edge Detector plus fine-tuned SegFormer, complemented by Synthetic Heatmapping via LoRA-tuned CLIP-ViT) for forgery localization. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims of cross-domain generalization and branch complementarity rest on empirical design and ablations rather than any reduction to inputs by construction. This is a standard non-circular engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is a new architectural combination rather than new mathematical axioms or entities; it builds on existing deep learning components.

axioms (1)
  • domain assumption Fine-tuning pre-trained vision transformers like SegFormer and CLIP is effective for detecting image manipulations.
    The paper relies on this without providing supporting evidence or comparisons in the abstract.

pith-pipeline@v0.9.0 · 5549 in / 1239 out tokens · 66151 ms · 2026-05-13T07:19:44.643837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Mediaeval 2025 synthetic images detection challenge: Advancing detection of gen- erative ai used in real-world online images,https://github.com/mever-team/ mediaeval2025-sid

  2. [2]

    Synthetic images: Advancing detection of generative ai used in real-world on- line images (2025),https://multimediaeval.github.io/editions/2025/task/ synthim/

  3. [3]

    IEEE Transactions on Information Forensics and Security15, 144–159 (2018), https://api.semanticscholar.org/CorpusID:52097748

    Cozzolino, D., Verdoliva, L.: Noiseprint: A cnn-based camera model fingerprint. IEEE Transactions on Information Forensics and Security15, 144–159 (2018), https://api.semanticscholar.org/CorpusID:52097748

  4. [4]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(3), 3539–3553 (2022)

    Dong, C., Chen, X., Hu, R., Cao, J., Li, X.: Mvss-net: Multi-view multi-scale super- vised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence45(3), 3539–3553 (2022)

  5. [5]

    Giakoumoglou, P., Karageorgiou, D., Papadopoulos, S., Petrantonakis, P.C.: Sagi: Semantically aligned and uncertainty guided ai image inpainting (2025),https: //arxiv.org/abs/2502.06593

  6. [6]

    Vision transformers are parameter- efficient audio-visual learners

    Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: TruFor: Lever- aging All-Round Clues for Trustworthy Image Forgery Detection and Localiza- tion . In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 20606–20615. IEEE Computer Society, Los Alamitos, CA, USA (Jun 2023).https://doi.org/10.1109/CVPR52729.2023....

  7. [7]

    Vision transformers are parameter- efficient audio-visual learners

    Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization. In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 3155–3165 (2023).https: //doi.org/10.1109/CVPR52729.2023.00308

  8. [8]

    Proceedings of the IEEE66(1), 51–83 (2005)

    Harris, F.J.: On the use of windows for harmonic analysis with the discrete fourier transform. Proceedings of the IEEE66(1), 51–83 (2005)

  9. [9]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models (2021),https://arxiv. org/abs/2106.09685

  10. [10]

    Karageorgiou, D., Papadopoulos, S., Kompatsiaris, I., Gavves, E.: Any-resolution ai-generated image detection by spectral learning. pp. 18706–18717 (06 2025). https://doi.org/10.1109/CVPR52734.2025.01743

  11. [11]

    Konstantinidou, D., Koutlis, C., Papadopoulos, S.: Texturecrop: Enhancing syn- thetic image detection through texture-based cropping. pp. 1369–1378 (02 2025). https://doi.org/10.1109/WACVW65960.2025.00160

  12. [12]

    In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part LXXII

    Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate encoder-blocks for synthetic image detection. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part LXXII. p. 394–411. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.org/10.1007/978-3-031-73220-1_23

  13. [13]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Kwon, M.J., Yu, I.J., Nam, S.H., Lee, H.K.: Cat-net: Compression artifact tracing network for detection and localization of image splicing. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 375–384 (2021)

  14. [14]

    arXiv preprint arXiv:2307.14863 (2023) 14 Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, and Trong-Le Do

    Ma, X., Du, B., Jiang, Z., Hammadi, A.Y.A., Zhou, J.: Iml-vit: Bench- marking image manipulation localization by vision transformer. arXiv preprint arXiv:2307.14863 (2023) 14 Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, and Trong-Le Do

  15. [15]

    Advances in Neural Information Processing Systems37, 134591–134613 (2024)

    Ma, X., Zhu, X., Su, L., Du, B., Jiang, Z., Tong, B., Lei, Z., Yang, X., Pun, C.M., Lv, J., et al.: Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization. Advances in Neural Information Processing Systems37, 134591–134613 (2024)

  16. [16]

    In: 2024 IEEE International Work- shop on Information Forensics and Security (WIFS)

    Mareen, H., Karageorgiou, D., Wallendael, G.V., Lambert, P., Papadopoulos, S.: Tgif: Text-guided inpainting forgery dataset. In: 2024 IEEE International Work- shop on Information Forensics and Security (WIFS). pp. 1–6 (2024).https: //doi.org/10.1109/WIFS61860.2024.10810690

  17. [17]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020

  18. [18]

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation (2015),https://arxiv.org/abs/1505.04597

  19. [19]

    A low- shot object counting network with iterative prototype adaptation, in: ICCV, Paris, France, October 1-6, 2023, IEEE

    Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion-generated image detection. In: 2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 22388–22398 (2023).https://doi.org/10. 1109/ICCV51070.2023.02051

  20. [20]

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers (12 2021)

  21. [21]

    T-VSL: text-guided visual sound source localization in mixtures

    Yu, Z., Ni, J., Lin, Y., Deng, H., Li, B.: Diffforensics: Leveraging diffusion prior to image forgery detection and localization. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12765–12774 (2024). https://doi.org/10.1109/CVPR52733.2024.01213