pith. machine review for the scientific record. sign in

arxiv: 2605.09089 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Field-Localized Forgery Detection for Digital Identity Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords forgery detectionidentity documentsfield localizationlightweight neural networkdigital forensicsface and text fieldsmanipulation detectionAUC EER
0
0 comments X

The pith

Targeting face and text fields in identity documents detects forgeries more accurately and with far less computation than full-image methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FLiD, a framework that first localizes face and text fields in digital identity documents with an object detector, then extracts embeddings from those regions only and classifies them for forgery. This targeted processing yields AUC scores of 0.880 on face attacks, 0.954 on text attacks, and 0.923 on combined attacks, with corresponding EER reductions of 29-35 percentage points over full-document baselines. The approach also beats general manipulation detectors while using 13 times fewer parameters and 21 times fewer FLOPs. Readers would care because remote identity verification depends on these documents and is vulnerable to localized edits that standard full-image methods miss.

Core claim

FLiD localizes face and text fields using a fine-tuned object detector, extracts compact field-level embeddings with a frozen MobileNetV3-Small backbone, and classifies them via a lightweight neural network containing only 191K trainable parameters. On face, text, and both-field attack scenarios it records AUCs of 0.880, 0.954, and 0.923 with EERs of 18.05%, 11.61%, and 15.16%. These figures represent absolute improvements of 29-35 percentage points over a full-document model trained from scratch and outperform general-purpose detectors such as TruFor, MMFusion, and UniVAD while requiring 13x fewer parameters and 21x fewer FLOPs.

What carries the argument

The two-stage field-localized pipeline that uses object detection to isolate face and text regions, then applies a frozen backbone and small classifier exclusively to those regions.

If this is right

  • Forgery detection accuracy improves specifically for manipulations confined to the identity photo or textual data.
  • Computational cost drops enough to support real-time checks on edge hardware.
  • The same localized strategy outperforms off-the-shelf general forgery detectors on structured documents.
  • Only the critical identity fields need processing rather than the entire document image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the localization step remains reliable across new document layouts, the method could extend to passports, driving licences, and other structured identity media.
  • The low parameter count opens the possibility of on-device deployment without sending images to the cloud.
  • Further gains may appear by adding localization for additional fields such as signatures or barcodes.
  • The separation of localization from classification makes it easier to swap in improved detectors without retraining the entire forgery classifier.

Load-bearing premise

The object detector accurately localizes face and text fields even after those fields have been forged or altered.

What would settle it

Performance measured on a held-out set of identity documents from an unseen issuer or containing forgery types absent from training, especially cases where field localization accuracy falls below the level seen in the reported experiments.

Figures

Figures reproduced from arXiv: 2605.09089 by Abhishek Kumar, Carsten Maple, Mark Hooper, Riya Tapwal.

Figure 1
Figure 1. Figure 1: Examples of Localized Forgery in Digital Identity Documents Across Face, Text, and Combined Manipulation Scenarios the identity card and a live selfie or captured photograph. This dependence on document images makes such systems vulnerable to localised manipulations, in￾cluding face replacement, text editing, or both-field field tampering, which can enable identity fraud while remaining visually difficult … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed identity-aware forgery detection system. The pipeline first localizes identity fields using YOLOv8, extracts field-level embeddings using Mo￾bileNetV3, and performs forgery classification using a lightweight classifier. TextDiffuser2), and are categorized into face-only, text-only, and both-field face+text attacks. FantasyID has been adopted in the DeepID ICCV Challenge, establish￾… view at source ↗
Figure 3
Figure 3. Figure 3: Equal Error Rate (EER) analysis for FLiD and the baseline across face, text, and both-field attack scenarios. ranking performance across all attack types, with notably low variance for text attacks (±0.016), indicating consistent generalisation across data folds. Score Distribution Analysis: The score histograms in [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of FLiD and the baseline across face, text, and both-field attack scenarios. (a) ROC curves illustrate ranking performance. (b) Score distributions show class separability and overlap. der text attacks), precision remains lower, indicating increased false positives. MMFusion and UniVAD degrade further across scenarios, highlighting the dif￾ficulty of distinguishing structured identity edits from… view at source ↗
read the original abstract

Digital identity verification systems used in remote onboarding rely on document images to authenticate users, making them vulnerable to localized manipulations of key identity fields such as facial photographs and textual information. Existing forgery detection methods, developed primarily for natural-image forensics, show limited transferability to structured identity documents. We propose FLiD, a lightweight field-localized framework that targets critical identity regions rather than processing full-document images. A fine-tuned object detector first localizes face and text fields; a frozen MobileNetV3-Small backbone then extracts compact field-level embeddings, which are classified by lightweight neural network with only 191K trainable parameters. FLiD achieves AUC scores of 0.880 (face), 0.954 (text), and 0.923 (both-field attacks), with corresponding EERs of 18.05%, 11.61%, and 15.16%, representing absolute reductions of 29-35 percentage points over a full-document baseline trained from scratch. FLiD also consistently outperforms general-purpose manipulation detectors (TruFor, MMFusion, UniVAD) across all attack scenarios while requiring 13x fewer parameters and 21x fewer FLOPs

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes FLiD, a lightweight field-localized forgery detection framework for digital identity documents. A fine-tuned object detector localizes face and text fields; embeddings are then extracted from these regions using a frozen MobileNetV3-Small backbone and classified by a small neural network (191K trainable parameters). The method reports AUC scores of 0.880 (face attacks), 0.954 (text attacks), and 0.923 (both-field attacks) with EERs of 18.05%, 11.61%, and 15.16%, claiming absolute gains of 29-35 points over a full-document baseline and consistent outperformance of general-purpose detectors (TruFor, MMFusion, UniVAD) at 13x fewer parameters and 21x fewer FLOPs.

Significance. If the localization step remains reliable on forged inputs, FLiD demonstrates a practical advance for structured-document forensics by focusing computation on identity-critical regions rather than full images. The efficiency numbers and direct comparisons to both task-specific baselines and general manipulation detectors are concrete strengths; the approach could inform deployment in remote identity verification systems where compute and data constraints matter.

major comments (1)
  1. [Experiments] Experiments section: the paper provides no mAP, IoU, precision-recall, or failure-mode statistics for the fine-tuned object detector on forged or manipulated documents. Because the pipeline extracts MobileNetV3 embeddings only from the detector's crops, any degradation in localization accuracy caused by splicing boundaries, texture mismatches, or altered aspect ratios would invalidate the reported AUC/EER gains over the full-document baseline. An ablation isolating localization error from classification error is also absent.
minor comments (2)
  1. [Abstract] Abstract and §3: the EER and AUC figures are presented without confidence intervals or statistical significance tests against the baselines; adding these would strengthen the performance claims.
  2. [Method] The description of the lightweight classifier (191K parameters) would benefit from an explicit layer-by-layer parameter count or diagram to allow direct reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need to validate the localization step, which is foundational to the FLiD pipeline. We address the concern point by point below and will revise the manuscript to incorporate the suggested analyses.

read point-by-point responses
  1. Referee: Experiments section: the paper provides no mAP, IoU, precision-recall, or failure-mode statistics for the fine-tuned object detector on forged or manipulated documents. Because the pipeline extracts MobileNetV3 embeddings only from the detector's crops, any degradation in localization accuracy caused by splicing boundaries, texture mismatches, or altered aspect ratios would invalidate the reported AUC/EER gains over the full-document baseline. An ablation isolating localization error from classification error is also absent.

    Authors: We agree that the absence of these metrics leaves an important aspect of the pipeline unverified. In the revised manuscript we will report mAP, IoU, precision-recall curves, and qualitative failure-mode analysis for the fine-tuned object detector evaluated separately on pristine and forged document images. We will also add an ablation that replaces the detector's predicted crops with ground-truth bounding boxes and recomputes the downstream AUC/EER; the difference between the two settings will directly quantify the contribution of localization error. These additions will confirm that localization remains sufficiently accurate on manipulated inputs and that the reported gains over the full-document baseline are not artifacts of detector failure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external baselines

full rationale

The paper describes an empirical pipeline (fine-tuned detector + frozen MobileNetV3 + lightweight classifier) and reports AUC/EER metrics obtained from direct evaluation on attack scenarios. These metrics are compared against independently published external detectors (TruFor, MMFusion, UniVAD) and a full-document baseline trained from scratch. No equations, predictions, or uniqueness claims appear; performance numbers are not derived from fitted parameters that are then re-labeled as predictions, nor do they reduce to self-citations or self-definitions. The central claims rest on experimental outcomes rather than any tautological reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach builds on standard computer vision techniques with empirical performance claims; no new physical entities or unproven mathematical axioms are introduced.

free parameters (1)
  • Number of trainable parameters (191K) = 191000
    The lightweight classifier has 191K trainable parameters, chosen to keep the model small.
axioms (1)
  • domain assumption Pre-trained backbones like MobileNetV3-Small can provide useful embeddings for downstream classification tasks when frozen.
    Assumes transfer learning works well for this domain.

pith-pipeline@v0.9.0 · 5512 in / 1429 out tokens · 41542 ms · 2026-05-12T02:25:50.961514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    In: 2020 IEEE International Joint Conference on Biometrics (IJCB)

    Albiero, V., Srinivas, N., Villalobos, E., Perez-Facuse, J., Rosenthal, R., Mery, D., Ricanek, K., Bowyer, K.W.: Identity document to selfie face matching across adolescence. In: 2020 IEEE International Joint Conference on Biometrics (IJCB). pp. 1–9. IEEE (2020) 1, 4

  2. [2]

    Symmetry17(8) (2025).https://doi.org/10.3390/ sym17081208,https://www.mdpi.com/2073-8994/17/8/12085

    Bae, Y.Y., Cho, D.J., Jung, K.H.: Enhancing document forgery detection with edge-focused deep learning. Symmetry17(8) (2025).https://doi.org/10.3390/ sym17081208,https://www.mdpi.com/2073-8994/17/8/12085

  3. [3]

    Biometrics, I.: Iso/iec 30107: Information technology—biometric presentation at- tack detection (2016) 9

  4. [4]

    Standard, International Organization for Standard- ization, Geneva, Switzerland (2017) 9

    Biometrics, I.: Information technology–biometric presentation attack detection– part 3: Testing and reporting. Standard, International Organization for Standard- ization, Geneva, Switzerland (2017) 9

  5. [5]

    Ranftl, A

    Chen, X., Dong, C., Ji, J., Cao, J., Li, X.: Image manipulation detection by multi- view multi-scale supervision (10 2021).https://doi.org/10.1109/ICCV48922. 2021.013924

  6. [6]

    Journal of Innovation Management13(2), XVI– XXV (Sep 2025).https://doi.org/10.24840/2183-0606_013.002_l0034

    Dai, K., Alonso, J., Gutiérrez-Meana, J.: A machine learning framework for forgery detection in digital id documents. Journal of Innovation Management13(2), XVI– XXV (Sep 2025).https://doi.org/10.24840/2183-0606_013.002_l0034

  7. [7]

    In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP)

    Fathy, M.E., Patel, V.M., Chellappa, R.: Face-based active authentication on mo- bile devices. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 1687–1691. IEEE (2015) 4

  8. [8]

    Pattern Recogn.162(C) (Jun 2025).https://doi.org/ 10.1016/j.patcog.2025.111352,https://doi.org/10.1016/j.patcog.2025

    Gonzalez, S., Tapia, J.E.: Forged presentation attack detection for id cards on re- mote verification systems. Pattern Recogn.162(C) (Jun 2025).https://doi.org/ 10.1016/j.patcog.2025.111352,https://doi.org/10.1016/j.patcog.2025. 1113522, 4, 5, 9

  9. [9]

    What’s in the image? a deep-dive into the vision of vision language models

    Gu, Z., Zhu, B., Zhu, G., Chen, Y., Tang, M., Wang, J.: Univad: A training- free unified model for few-shot visual anomaly detection. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15194– 15203 (2025).https://doi.org/10.1109/CVPR52734.2025.014152, 4

  10. [10]

    Assran, Q

    Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20606–20615 (2023).https://doi.org/10.1109/CVPR52729.2023.019742, 4

  11. [11]

    Assran, Q

    Guo, X., Liu, X., Ren, Z., Grosz, S., Masi, I., Liu, X.: Hierarchical fine-grained image forgery detection and localization. In: 2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 3155–3165 (2023).https: //doi.org/10.1109/CVPR52729.2023.003084

  12. [12]

    arXiv preprint arXiv:2507.20808 (2025) 5

    Korshunov, P., Mohammadi, A., Vidit, V., Ecabert, C., Marcel, S.: Fantasyid: A dataset for detecting digital manipulations of id-documents. arXiv preprint arXiv:2507.20808 (2025) 5

  13. [13]

    URLhttps://openaccess.thecvf.com/content/WACV2021/html/ Mathew_DocVQA_A_Dataset_for_VQA_on_Document_Images_WACV_2021_paper.html

    Kwon, M.J., Yu, I.J., Nam, S.H., Lee, H.K.: Cat-net: Compression artifact tracing network for detection and localization of image splicing. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 375–384 (2021). https://doi.org/10.1109/WACV48630.2021.000424

  14. [14]

    Signal Processing238, 110123 (2025).https: //doi.org/10.1016/j.sigpro.2025.110123,https://www.sciencedirect.com/ science/article/pii/S01651684250023735 16 A

    Li,W.,Li,B.,Zheng,K.,Li,S.,Li,H.:Documentimageforgerydetectionandlocal- ization in desensitization scenarios. Signal Processing238, 110123 (2025).https: //doi.org/10.1016/j.sigpro.2025.110123,https://www.sciencedirect.com/ science/article/pii/S01651684250023735 16 A. Kumar et al

  15. [15]

    In: CVPR

    Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection . In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10770– 10780. IEEE Computer Society, Los Alamitos, CA, USA (Jun 2024).https:// doi.org/10.1109/CVPR52733.2024.01024,https://doi.ieeecomp...

  16. [16]

    IEEE Transactions on Circuits and Systems for Video Technology32(11), 7505–7517 (2022).https: //doi.org/10.1109/TCSVT.2022.31895454

    Liu, X., Liu, Y., Chen, J., Liu, X.: Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology32(11), 7505–7517 (2022).https: //doi.org/10.1109/TCSVT.2022.31895454

  17. [17]

    Fouriscale: A frequency perspective on training-free high-resolution image synthesis,

    Luo, D., Zhou, Y., Yang, R., Liu, Y., Liu, X., Zeng, J., Yang, B., Huang, Z., Jin, L.: Icdar 2023 competition on detecting tampered text in images. In: Proceedings of theInternationalConferenceonDocumentAnalysisandRecognition(ICDAR).pp. 587–600. ICDAR (2023),https://link.springer.com/chapter/10.1007/978-3- 031-41679-8_364

  18. [18]

    Martínez Tornés, B.M., Taburet, T., Boros, E., Rouis, K., Doucet, A., Sidère, N., PoulainD’Andecy,V.:Receiptdatasetfordocumentforgerydetection.In:Proceed- ings of the 17th International Conference on Document Analysis and Recognition (ICDAR). pp. 454–469 (2023),https://hal.science/hal-04295385v1/document 4

  19. [19]

    IEEE Transactions on Information Forensics and Security14(5), 1240– 1250 (2018) 4

    Perera, P., Patel, V.M.: Face-based multiple user active authentication on mobile devices. IEEE Transactions on Information Forensics and Security14(5), 1240– 1250 (2018) 4

  20. [20]

    Journal of Imaging8(7), 181 (2022).https://doi.org/10.3390/ jimaging8070181,https://www.mdpi.com/2313-433X/8/7/1814

    Polevoy, D.V., Sigareva, I.V., Ershova, D.M., Arlazarov, V.V., Nikolaev, D.P., Ming, Z., Luqman, M.M., Burie, J.C.: Document liveness challenge dataset (dlc-2021). Journal of Imaging8(7), 181 (2022).https://doi.org/10.3390/ jimaging8070181,https://www.mdpi.com/2313-433X/8/7/1814

  21. [21]

    IEEE Transactions on Biometrics, Behavior, and Identity Science1(1), 56–67 (2019) 1, 2, 4

    Shi, Y., Jain, A.K.: Docface+: Id document to selfie matching. IEEE Transactions on Biometrics, Behavior, and Identity Science1(1), 56–67 (2019) 1, 2, 4

  22. [22]

    In: 2024 IEEE International Joint Conference on Biometrics (IJCB)

    Tapia, J.E., Damer, N., Busch, C., Espin, J.M., Barrachina, J., Rocamora, A.S., Ocvirk, K., Alessio, L., Batagelj, B., Patwardhan, S., et al.: First competition on presentation attack detection on id card. In: 2024 IEEE International Joint Conference on Biometrics (IJCB). pp. 1–10. IEEE (2024) 2, 4

  23. [23]

    & Sena, C

    Triaridis, K., Mezaris, V.: Exploring multi-modal fusion for image manipulation detection and localization. In: Proceedings of the 30th International Conference on MultiMedia Modeling (MMM). Lecture Notes in Computer Science, vol. 14556, pp. 198–211. Springer (2024).https://doi.org/10.1007/978- 3- 031- 53311- 2_15, https://link.springer.com/chapter/10.100...

  24. [24]

    In: Proceedings of the 31st ACM International Conference on Multimedia

    Wang, H., Li, S., Cao, S., Yang, R., Zeng, J., Qian, Z., Zhang, X.: On physi- cally occluded fake identity document detection. In: Proceedings of the 31st ACM International Conference on Multimedia. p. 1556–1564. MM ’23, Association for Computing Machinery, New York, NY, USA (2023).https://doi.org/10.1145/ 3581783.3612075,https://doi.org/10.1145/3581783.3...

  25. [25]

    Journal of Elec- tronic Imaging34(4), 043018 (2025).https://doi.org/10.1117/1.JEI.34.4

    Wang, L., Li, Z., Zhao, W.: Research on identity document image tampering detec- tion based on texture understanding and multistream networks. Journal of Elec- tronic Imaging34(4), 043018 (2025).https://doi.org/10.1117/1.JEI.34.4. 043018,https://doi.org/10.1117/1.JEI.34.4.0430184

  26. [26]

    IEEE Access10, 123337–123348 (2022).https://doi.org/10.1109/ ACCESS.2022.32242354, 5 Abbreviated paper title 17

    Xu, J., Jia, D., Lin, Z., Zhou, T.: Psfnet: A deep learning network for fake passport detection. IEEE Access10, 123337–123348 (2022).https://doi.org/10.1109/ ACCESS.2022.32242354, 5 Abbreviated paper title 17

  27. [27]

    In: CVPR

    Yu, Z., Ni, J., Lin, Y., Deng, H., Li, B.: Diffforensics: Leveraging diffusion prior to image forgery detection and localization. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12765–12774 (2024). https://doi.org/10.1109/CVPR52733.2024.012134

  28. [28]

    Pattern Recognition161, 111253 (2025) 1

    Zhang, Y., Li, Q., Yu, Z., Shen, L.: Distilled transformers with locally enhanced global representations for face forgery detection. Pattern Recognition161, 111253 (2025) 1

  29. [29]

    Journal of Visual Communication and Image Representation 58, 380–399 (2019) 2, 4 A ROI Extraction vs

    Zheng, L., Zhang, Y., Thing, V.L.: A survey on image tampering and its detection in real-world photos. Journal of Visual Communication and Image Representation 58, 380–399 (2019) 2, 4 A ROI Extraction vs. Whole-Image Input To quantify the benefit of field-level analysis, we train the same lightweight classifier on frozen MobileNetV3-Small embeddings (576-...

  30. [30]

    For the combined attack scenario, all three backbones perform comparably (0.944–0.969), suggesting that Abbreviated paper title 19 T able 7:Backbone ablation (5-fold CV)

    Notably, the larger backbones perform poorly on text attacks (AUC≈0.54– 0.56, near chance), whereas MobileNetV3-Small achieves an AUC of 0.950, in- dicating that its compact depthwise-separable features capture subtle texture cues introduced by text manipulations more effectively. For the combined attack scenario, all three backbones perform comparably (0...