arxiv: 2605.12002 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization

Minh-Hoang Le, Minh-Khoa Le-Phan, Minh-Triet Tran, Trong-Le Do

Pith reviewed 2026-05-13 07:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords image forgery localizationedge-guided segmentationsynthetic heatmapcross-domain generalizationfrequency edge detectionpatch-based localizationmulti-resolution processing

0 comments

The pith

A dual-branch network localizes image forgeries at native resolution by fusing frequency edge cues with patch-level synthetic priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a patch-based method to find manipulated regions in images that have been edited with realistic text-guided inpainting. One branch detects high-frequency inconsistencies at manipulation boundaries using a frequency edge detector and a fine-tuned segmentation model on RGB plus edge features. The second branch uses a fine-tuned vision transformer to classify entire patches as fully synthetic, supplying coarse location hints. These two signals complement each other so the system can handle images of any size without downsampling and still work across different editing styles or datasets. The result is claimed to be accurate localization that stays resolution-agnostic.

Core claim

The central claim is that a dual-branch architecture, consisting of an Edge-Guided Segmentation module that emphasizes manipulation boundaries via frequency analysis and a Synthetic Heatmapping module that flags fully synthetic patches, produces comprehensive forgery masks for arbitrary-resolution images while maintaining strong performance when the test images come from domains different from the training data.

What carries the argument

The dual-branch EDGER framework, in which frequency-based edge features sharpen boundaries inside mixed patches while patch-level synthetic classification supplies coarse location priors.

If this is right

The method processes multi-megapixel images at their original size instead of requiring downsampling.
Localization accuracy improves when edge cues and synthetic patch scores are combined rather than used separately.
The approach supports cross-domain testing without retraining for each new image source or manipulation type.
Patch-level decisions allow the system to flag fully synthetic regions even when boundary edges are weak.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same edge-plus-classification split could be tested on video forgery detection by treating frames as patches.
If the frequency edge detector is replaced with a learned edge model, the framework might adapt to smoother manipulation styles.
The coarse-to-fine refinement pattern might apply to other localization tasks such as medical image anomaly detection.

Load-bearing premise

Edge information is useful mainly in patches that mix real and fake pixels, and fine-tuning the segmentation and classification models will produce reliable results on unseen image domains.

What would settle it

Running the system on a set of images whose manipulations have been smoothed to remove high-frequency boundary signals, or on a new editing domain where both branches fail to flag the altered regions, would show whether the localization remains accurate.

Figures

Figures reproduced from arXiv: 2605.12002 by Minh-Hoang Le, Minh-Khoa Le-Phan, Minh-Triet Tran, Trong-Le Do.

**Figure 1.** Figure 1: Overall localization pipeline. A patch-based dual-branch design: (i) the EGS branch injects a frequency-based edge prior (Section 3.1) into a SegFormer decoder; non-overlapping patches, padded to sizes divisible by 32, are stitched to full resolution. (ii) The SH branch applies a CLIP-ViT classifier over sliding windows with Hann blending to form a dense heatmap. The two outputs are fused by pixel-wise max… view at source ↗

**Figure 2.** Figure 2: Pipeline of the Frequency-based Edge Detector. The network extracts highpass and SRM-like residuals, computes fixed Sobel and Laplacian gradients from RGB, and fuses them through a multi-branch convolutional head composed of residual and dilated context blocks with global pooling. The head integrates frequency, gradient, and structural cues to predict soft edge probability maps supervised by multi-scale t… view at source ↗

**Figure 3.** Figure 3: Predicted edge maps and their augmented counterparts used to simulate challenging test-time conditions. Augmentations lightly degrade or fragment boundaries (local background mixing, temperature softening, Gaussian blur, segment breaks, bandlimited noise) to improve robustness. 4.2 Comparison with Existing Localization Methods We compare our framework with representative manipulation localization models… view at source ↗

**Figure 4.** Figure 4: Each group of five images shows, from left to right, the EGS mask, the SH mask, the fuse mask, the ground-truth mask, and the original image. In (a)–(c), showing large images, SH supplies strong region cues that complement EGS. In (d), showing a small image, EGS captures fine details and performs well even without SH. The fused output combines these complementary strengths. 4.4 Qualitative Result In Fig.4,… view at source ↗

**Figure 5.** Figure 5: Each triplet shows, from left to right, the SH mask, the ground-truth mask, and the original image. In (a)–(b), showing large images, SH captures coarse regions but misses fine boundaries and small structures. In (c)-(d), showing small images, SH produces blurry responses and overlooks many manipulated pixels. This shows that SH alone is insufficient without complementary edge guidance and fusion. In Fig.4… view at source ↗

read the original abstract

Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EDGER pairs frequency edge detection with SegFormer and a LoRA-tuned CLIP branch for forgery localization, but the generalization claims need actual numbers to stand.

read the letter

EDGER splits the localization task into two branches: a frequency-based edge detector feeding into fine-tuned SegFormer to sharpen boundaries in patches that mix real and fake pixels, plus a CLIP-ViT classifier with LoRA that flags entirely synthetic patches. The design treats edge cues as useful mainly for partial manipulations and the classifier as the prior for full ones, then combines them for the final mask. This keeps native resolution by staying patch-based and targets the realistic inpainting forgeries that current methods miss across domains. The frequency edge step is a direct, low-overhead addition that could catch high-frequency artifacts better than plain RGB edges. The split logic is clear and avoids forcing one model to do everything. The paper does a reasonable job stating why existing SID and IFL approaches struggle with cross-domain cases. The soft spots sit in the evidence. The abstract asserts strong cross-domain generalization and multi-megapixel scaling, yet supplies no metrics, ablation tables, or held-out domain results to back it up. The stress-test concern about the synthetic heatmapping branch failing on fully manipulated regions when the domain shifts is worth checking directly in the experiments; if the two branches give conflicting signals on new generators or compression, the claimed complementarity collapses. Fine-tuning both models on the same training distribution may also limit transfer more than the text suggests. This work is aimed at computer vision groups doing media forensics and deepfake detection who need practical localization tools rather than pure theory. It shows straightforward engineering thinking and deserves a serious referee to examine the full results and see whether the generalization actually materializes.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes EDGER, a patch-based dual-branch framework for localizing manipulated regions in arbitrary-resolution images. The Edge-Guided Segmentation branch uses a Frequency-based Edge Detector to highlight high-frequency boundary inconsistencies and fine-tunes SegFormer to fuse RGB and edge features into pixel-level masks. The Synthetic Heatmapping branch fine-tunes a CLIP-ViT encoder with LoRA to classify fully synthetic patches and supply coarse priors. These branches are combined to produce resolution-agnostic localization, with the central claim being strong cross-domain generalization on the MediaEval 2025 SynthIM challenge.

Significance. If the generalization claims hold with supporting metrics, the work could advance forgery localization by addressing text-guided inpainting challenges through complementary edge and synthetic cues, offering a practical solution for multi-megapixel imagery without resolution downsampling.

major comments (2)

[§3.2] §3.2 (Synthetic Heatmapping branch): The premise that the CLIP-ViT LoRA branch reliably flags fully synthetic patches under domain shift, while SegFormer sharpens boundaries only in mixed patches, is asserted without quantitative measures of branch agreement, conflict rates, or failure cases on held-out domains (e.g., new inpainting models or compression); this is load-bearing for the cross-domain generalization claim.
[§4] §4 (Evaluation): The abstract states that the method 'exhibits strong cross-domain generalization' and 'scales to multi-megapixel imagery' with 'extensive ablations,' yet no performance metrics, baseline comparisons, ablation tables, or results on held-out domains are referenced; without these, the central claims cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed the major comments point by point below, making revisions to improve the clarity of our contributions and results.

read point-by-point responses

Referee: [§3.2] §3.2 (Synthetic Heatmapping branch): The premise that the CLIP-ViT LoRA branch reliably flags fully synthetic patches under domain shift, while SegFormer sharpens boundaries only in mixed patches, is asserted without quantitative measures of branch agreement, conflict rates, or failure cases on held-out domains (e.g., new inpainting models or compression); this is load-bearing for the cross-domain generalization claim.

Authors: We concur that quantitative validation of the branch-specific behaviors under domain shift is important for supporting the generalization claims. Accordingly, we have revised §3.2 to include metrics on branch agreement, conflict rates, and failure case analysis using held-out domains from the challenge, including variations in inpainting models and compression. The added results indicate low conflict rates and high reliability of the Synthetic Heatmapping branch in flagging fully synthetic patches, thereby reinforcing the dual-branch design. revision: yes
Referee: [§4] §4 (Evaluation): The abstract states that the method 'exhibits strong cross-domain generalization' and 'scales to multi-megapixel imagery' with 'extensive ablations,' yet no performance metrics, baseline comparisons, ablation tables, or results on held-out domains are referenced; without these, the central claims cannot be assessed.

Authors: We appreciate this observation. While Section 4 presents the evaluation results on the MediaEval 2025 SynthIM challenge, including metrics, baselines, ablations, and cross-domain performance, we acknowledge that the abstract and main text could reference them more explicitly. In the revised manuscript, we have updated the abstract and expanded §4 with clearer references to the tables and figures, as well as additional details on scalability to multi-megapixel images and held-out domains. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive dual-branch engineering framework with no derivation chain

full rationale

The paper describes a patch-based dual-branch architecture (Edge-Guided Segmentation using a Frequency-based Edge Detector plus fine-tuned SegFormer, complemented by Synthetic Heatmapping via LoRA-tuned CLIP-ViT) for forgery localization. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims of cross-domain generalization and branch complementarity rest on empirical design and ablations rather than any reduction to inputs by construction. This is a standard non-circular engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is a new architectural combination rather than new mathematical axioms or entities; it builds on existing deep learning components.

axioms (1)

domain assumption Fine-tuning pre-trained vision transformers like SegFormer and CLIP is effective for detecting image manipulations.
The paper relies on this without providing supporting evidence or comparisons in the abstract.

pith-pipeline@v0.9.0 · 5549 in / 1239 out tokens · 66151 ms · 2026-05-13T07:19:44.643837+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Mediaeval 2025 synthetic images detection challenge: Advancing detection of gen- erative ai used in real-world online images,https://github.com/mever-team/ mediaeval2025-sid

work page 2025
[2]

Synthetic images: Advancing detection of generative ai used in real-world on- line images (2025),https://multimediaeval.github.io/editions/2025/task/ synthim/

work page 2025
[3]

IEEE Transactions on Information Forensics and Security15, 144–159 (2018), https://api.semanticscholar.org/CorpusID:52097748

Cozzolino, D., Verdoliva, L.: Noiseprint: A cnn-based camera model fingerprint. IEEE Transactions on Information Forensics and Security15, 144–159 (2018), https://api.semanticscholar.org/CorpusID:52097748

work page 2018
[4]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(3), 3539–3553 (2022)

Dong, C., Chen, X., Hu, R., Cao, J., Li, X.: Mvss-net: Multi-view multi-scale super- vised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence45(3), 3539–3553 (2022)

work page 2022
[5]

Giakoumoglou, P., Karageorgiou, D., Papadopoulos, S., Petrantonakis, P.C.: Sagi: Semantically aligned and uncertainty guided ai image inpainting (2025),https: //arxiv.org/abs/2502.06593

work page arXiv 2025
[6]

Vision transformers are parameter- efficient audio-visual learners

Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: TruFor: Lever- aging All-Round Clues for Trustworthy Image Forgery Detection and Localiza- tion . In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 20606–20615. IEEE Computer Society, Los Alamitos, CA, USA (Jun 2023).https://doi.org/10.1109/CVPR52729.2023....

work page doi:10.1109/cvpr52729.2023.01974 2023
[7]

Vision transformers are parameter- efficient audio-visual learners

Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization. In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 3155–3165 (2023).https: //doi.org/10.1109/CVPR52729.2023.00308

work page doi:10.1109/cvpr52729.2023.00308 2023
[8]

Proceedings of the IEEE66(1), 51–83 (2005)

Harris, F.J.: On the use of windows for harmonic analysis with the discrete fourier transform. Proceedings of the IEEE66(1), 51–83 (2005)

work page 2005
[9]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models (2021),https://arxiv. org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Karageorgiou, D., Papadopoulos, S., Kompatsiaris, I., Gavves, E.: Any-resolution ai-generated image detection by spectral learning. pp. 18706–18717 (06 2025). https://doi.org/10.1109/CVPR52734.2025.01743

work page doi:10.1109/cvpr52734.2025.01743 2025
[11]

Konstantinidou, D., Koutlis, C., Papadopoulos, S.: Texturecrop: Enhancing syn- thetic image detection through texture-based cropping. pp. 1369–1378 (02 2025). https://doi.org/10.1109/WACVW65960.2025.00160

work page doi:10.1109/wacvw65960.2025.00160 2025
[12]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part LXXII

Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate encoder-blocks for synthetic image detection. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Pro- ceedings, Part LXXII. p. 394–411. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.org/10.1007/978-3-031-73220-1_23

work page doi:10.1007/978-3-031-73220-1_23 2024
[13]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Kwon, M.J., Yu, I.J., Nam, S.H., Lee, H.K.: Cat-net: Compression artifact tracing network for detection and localization of image splicing. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 375–384 (2021)

work page 2021
[14]

arXiv preprint arXiv:2307.14863 (2023) 14 Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, and Trong-Le Do

Ma, X., Du, B., Jiang, Z., Hammadi, A.Y.A., Zhou, J.: Iml-vit: Bench- marking image manipulation localization by vision transformer. arXiv preprint arXiv:2307.14863 (2023) 14 Minh-Khoa Le-Phan, Minh-Hoang Le, Minh-Triet Tran, and Trong-Le Do

work page arXiv 2023
[15]

Advances in Neural Information Processing Systems37, 134591–134613 (2024)

Ma, X., Zhu, X., Su, L., Du, B., Jiang, Z., Tong, B., Lei, Z., Yang, X., Pun, C.M., Lv, J., et al.: Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization. Advances in Neural Information Processing Systems37, 134591–134613 (2024)

work page 2024
[16]

In: 2024 IEEE International Work- shop on Information Forensics and Security (WIFS)

Mareen, H., Karageorgiou, D., Wallendael, G.V., Lambert, P., Papadopoulos, S.: Tgif: Text-guided inpainting forgery dataset. In: 2024 IEEE International Work- shop on Information Forensics and Security (WIFS). pp. 1–6 (2024).https: //doi.org/10.1109/WIFS61860.2024.10810690

work page doi:10.1109/wifs61860.2024.10810690 2024
[17]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021),https://arxiv.org/abs/ 2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation (2015),https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

A low- shot object counting network with iterative prototype adaptation, in: ICCV, Paris, France, October 1-6, 2023, IEEE

Wang, Z., Bao, J., Zhou, W., Wang, W., Hu, H., Chen, H., Li, H.: Dire for diffusion-generated image detection. In: 2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 22388–22398 (2023).https://doi.org/10. 1109/ICCV51070.2023.02051

work page arXiv 2023
[20]

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers (12 2021)

work page 2021
[21]

T-VSL: text-guided visual sound source localization in mixtures

Yu, Z., Ni, J., Lin, Y., Deng, H., Li, B.: Diffforensics: Leveraging diffusion prior to image forgery detection and localization. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12765–12774 (2024). https://doi.org/10.1109/CVPR52733.2024.01213

work page doi:10.1109/cvpr52733.2024.01213 2024