arxiv: 2604.28177 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.CY

Recognition: unknown

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Bo Zhang , Tzu-Yen Ma , Zichen Tang , Junpeng Ding , Zirui Wang , Yizhuo Zhao , Peilin Gao , Zijie Xi

show 13 more authors

Zixin Ding Haiyang Sun Haocheng Gao Yuan Liu Liangjia Wang Yiling Huang Yujie Wang Yuyue Zhang Ronghui Xi Yuanze Li Jiacheng Liu Zhongjun Yang Haihong E

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:30 UTC · model grok-4.3

classification 💻 cs.CV cs.CY

keywords AI-generated imagesimage forensicsacademic integritybenchmark evaluationmultimodal modelsforgery detectionlocalization accuracy

0 comments

The pith

AEGIS benchmark reveals that even advanced models detect AI-generated academic images at only 48.80 percent overall accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AEGIS as a new benchmark to test forensic tools on AI-generated images specifically in academic contexts. It covers seven academic fields with 39 fine-grained subtypes and applies four common forgery approaches drawn from 25 generative models. Testing shows GPT-5.1 reaches just 48.80 percent overall while expert models hit only 30.09 percent IoU on localization, and 11 of the generative models produce images that current forensics classify correctly less than half the time. A reader would care because figures in research papers often carry core evidence, so inability to spot synthetic versions risks eroding trust in published results. The benchmark also separates tasks into detection, reasoning, and localization, exposing that different model families perform better on different parts of the problem.

Core claim

AEGIS is a holistic benchmark for forensic analysis of AI-generated academic images. It advances prior work through domain-specific complexity across seven academic categories and 39 subtypes, diverse forgery simulations of four prevalent academic strategies using 25 generative models, and multi-dimensional evaluation that jointly measures detection, reasoning, and localization. When applied to 25 leading multimodal large language models, nine expert models, and one unified model, the benchmark finds GPT-5.1 at 48.80 percent overall performance, expert models limited to 30.09 percent IoU localization accuracy, 11 generative models yielding average forensic accuracy below 50 percent, MLLMs at

What carries the argument

The AEGIS benchmark dataset and its three-axis evaluation protocol that measures binary authenticity detection, textual artifact reasoning, and pixel-level localization on domain-specific academic images.

Load-bearing premise

The 39 academic subtypes and four simulated forgery strategies drawn from 25 generative models sufficiently represent the real-world distribution and difficulty of AI-generated academic images.

What would settle it

A new forensic system that achieves greater than 85 percent accuracy across all detection, reasoning, and localization tasks on the full AEGIS test set while maintaining the same performance on held-out real academic images would falsify the claim of fundamental limitations.

Figures

Figures reproduced from arXiv: 2604.28177 by Bo Zhang, Haihong E, Haiyang Sun, Haocheng Gao, Jiacheng Liu, Junpeng Ding, Liangjia Wang, Peilin Gao, Ronghui Xi, Tzu-Yen Ma, Yiling Huang, Yizhuo Zhao, Yuan Liu, Yuanze Li, Yujie Wang, Yuyue Zhang, Zhongjun Yang, Zichen Tang, Zijie Xi, Zirui Wang, Zixin Ding.

**Figure 1.** Figure 1: AEGIS investigates whether current models can effectively audit AI-generated academic images in academic papers by performing holistic forensic analysis across four complementary tasks. while, leveraging visual understanding and reasoning, multimodal large language models (MLLMs) have been increasingly applied to image forgery analysis, either directly (Wen et al., 2025) or in conjunction with expert mod… view at source ↗

**Figure 2.** Figure 2: Hierarchical taxonomy of AEGIS. We organize academic images into seven categories and 39 finegrained subtypes based on their structural and semantic characteristics. Real (left) and fake (right) examples are shown for comparison, where “fake” refers to AI-generated forgeries. Artifact Recognition, (3) Manipulation Classification, and (4) Tampering Pinpointing. These tasks comprehensively assess models’ c… view at source ↗

**Figure 3.** Figure 3: Construction pipeline of AEGIS. Stage 1: Paper Parsing extracts figures, captions, and panels from papers. Stage 2: Expert Curation retains qualified academic panels while excluding non-academic ones. Stage 3: Forgery Strategy Simulation synthesizes AI-generated academic image forgeries via four representative strategies. 2.2 Dataset Construction Data Curation. Papers were parsed into figures and panels, t… view at source ↗

**Figure 4.** Figure 4: Evaluation design of AEGIS. Four evaluation question types are designed to support staged forensic analysis, from global authenticity assessment to fine-grained region-level localization, including Forgery Scope Discrimination, Textual Artifact Recognition, Manipulation Classification, and Tampering Pinpointing. structing missing local content. By applying masks to specific regions of authentic images, we … view at source ↗

**Figure 5.** Figure 5: Fine-grained experimental analysis of AEGIS. TCF: Text Constraint Fabrication; IIF: Image Inference Forgery; TRR: Targeted Region Restoration; TRE: Targeted Region Editing. SMG: Stained Micrograph; MG: Micrograph. Seedream: Doubao-Seedream; SD: Stable Diffusion; SN: SenseNova. 3.2 Experimental Analysis 3.2.1 Domain-Specific Complexity Challenges Current Models We first investigate how the intrinsic complex… view at source ↗

**Figure 6.** Figure 6: Impact of post-processing perturbations on AEGIS. FSD: Forgery Scope Discrimination; TAR: Textual Artifact Recognition; MC: Manipulation Classification; TP: Tampering Pinpointing. 25 55 85 FSD TAR MC A c c u r a c y （ % ） Doubao-Seed-1.6-thinking (Zero-Shot) GPT-5.1 (Zero-Shot) Qwen3-VL-Plus (Few-Shot) Doubao-Seed-1.6-thinking (Few-Shot) GPT-5.1 (Few-Shot) Qwen3-VL-Plus (Zero-Shot) view at source ↗

**Figure 7.** Figure 7: Impact of Few-Shot prompting on AEGIS. degrade sharply under post-processing perturbations (i.e., Gaussian blurring, JPEG compression and image scaling; view at source ↗

**Figure 8.** Figure 8: Example of a retracted paper published in view at source ↗

**Figure 9.** Figure 9: Example of a retracted paper from the Lippin view at source ↗

**Figure 10.** Figure 10: Distribution of real images and images generated by four forgery strategie in AEGIS view at source ↗

**Figure 11.** Figure 11: Distribution of seven academic image categories in AEGIS. A.4 Data Curation Paper Parsing. AEGIS collected 4,362 highquality academic papers from the open-access PMC repository and performed document-level parsing to construct the initial visual corpus. The paper selection criteria were as follows: • Structural Completeness. Each paper must contain at least four independent figures to ensure sufficient… view at source ↗

**Figure 13.** Figure 13: Success case by GPT-5.1 on Forgery Scope Discrimination. The panel belongs to the Micrograph category view at source ↗

**Figure 14.** Figure 14: Failure case by GPT-5.1 on Forgery Scope Discrimination. The panel belongs to the Stained Micrograph category view at source ↗

**Figure 15.** Figure 15: Failure case by GPT-5.1 on Forgery Scope Discrimination. The panel belongs to the Diagram category view at source ↗

**Figure 16.** Figure 16: Failure case by GPT-5.1 on Forgery Scope Discrimination. The panel belongs to the Chart category view at source ↗

**Figure 17.** Figure 17: Failure case by GPT-5.1 on Forgery Scope Discrimination. The panel belongs to the Medical Imaging category view at source ↗

**Figure 18.** Figure 18: Success case by GPT-5.1 on Textual Artifact Recognition. The panel belongs to the Diagram category view at source ↗

**Figure 19.** Figure 19: Failure case by GPT-5.1 on Textual Artifact Recognition. The panel belongs to the Stained Micrograph category and contains a localized forgery generated via Targeted Region Editing view at source ↗

**Figure 20.** Figure 20: Success case by GPT-5.1 on Manipulation Classification. The panel belongs to the Micrograph category and contains a localized forgery generated via Targeted Region Editing view at source ↗

**Figure 21.** Figure 21: Failure case by GPT-5.1 on Manipulation Classification. The panel belongs to the Medical Imaging category and contains a localized forgery generated via Targeted Region Restoration view at source ↗

**Figure 22.** Figure 22: Failure case by GPT-5.1 on Manipulation Classification. The panel belongs to the Medical Imaging category and contains a localized forgery generated via Targeted Region Restoration view at source ↗

**Figure 23.** Figure 23: Success case by GPT-5.1 on Tampering Pinpointing. The panel belongs to the Physical Object category and contains a localized forgery generated via Targeted Region Restoration view at source ↗

**Figure 24.** Figure 24: Failure case by GPT-5.1 on Tampering Pinpointing. The panel belongs to the Stained Micrograph category and contains a localized forgery generated via Targeted Region Restoration view at source ↗

read the original abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AEGIS, a benchmark for forensic analysis of AI-generated academic images. It spans seven academic categories with 39 fine-grained subtypes, simulates four forgery strategies using 25 generative models, and evaluates 25 MLLMs plus nine expert detectors across detection, reasoning, and localization tasks. Key reported results include GPT-5.1 reaching only 48.80% overall performance, expert models at 30.09% IoU for localization, MLLMs at 84.74% on textual artifact recognition, and expert detectors at 79.54% on binary authenticity detection, positioning AEGIS as a diagnostic testbed for fundamental limitations in academic image forensics.

Significance. If the benchmark construction proves representative of real academic forgeries, AEGIS would offer a useful standardized resource for the community. The multi-dimensional evaluation protocol, broad coverage of model families, and explicit comparison of generative versus forensic capabilities could help guide future work on image integrity in scholarly publishing. The empirical scale (25+ models evaluated) is a practical strength for a benchmark paper.

major comments (3)

[§3] §3 (Benchmark Construction): The central claim that low performances (e.g., GPT-5.1 at 48.80%, expert IoU 30.09%) expose 'fundamental limitations' and that 'forensics lag behind generative advances' rests on the assumption that the 39 subtypes and four simulated forgery strategies match the distribution of real AI-generated academic images. The manuscript constructs these via simulation with 25 generative models but reports no external validation (statistical comparison of artifacts, expert realism ratings, or direct comparison against disclosed real AI figures from arXiv or journals). This is load-bearing for the 'diagnostic testbed' assertion.
[§4] §4 (Evaluation Protocol): Performance numbers are presented as point estimates (e.g., 48.80% overall, 84.74% textual artifact recognition) without reported confidence intervals, per-category variance, or statistical tests for differences between MLLM and expert families. Given the claim of 'complementary strengths' and the use of held-out test images, absence of these measures makes it difficult to judge whether observed gaps are robust or sensitive to image selection.
[§2, §5] §2 and §5 (Related Work and Discussion): The selection of the 25 generative models and the four forgery strategies is described at a high level, but the paper does not detail inclusion criteria or whether the set covers recent techniques (e.g., latest diffusion or autoregressive models). This affects the strength of the conclusion that 11 models yield <50% forensic accuracy and that forensics are lagging.

minor comments (2)

[Tables] Table captions and axis labels in result tables should explicitly define all abbreviations (e.g., IoU, MLLM) and report the exact number of images per subtype to allow reproducibility.
[Figure 1] The pipeline figure would benefit from explicit call-outs for the four forgery strategies and the three evaluation dimensions (detection, reasoning, localization) to improve immediate readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review, which has helped us identify areas to strengthen the manuscript. We address each major comment point by point below, with clear indications of planned revisions. Our responses focus on substance and aim to enhance the clarity and robustness of the benchmark without overstating its scope.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The central claim that low performances (e.g., GPT-5.1 at 48.80%, expert IoU 30.09%) expose 'fundamental limitations' and that 'forensics lag behind generative advances' rests on the assumption that the 39 subtypes and four simulated forgery strategies match the distribution of real AI-generated academic images. The manuscript constructs these via simulation with 25 generative models but reports no external validation (statistical comparison of artifacts, expert realism ratings, or direct comparison against disclosed real AI figures from arXiv or journals). This is load-bearing for the 'diagnostic testbed' assertion.

Authors: We appreciate the referee's emphasis on this foundational assumption. The 39 subtypes and four forgery strategies were derived from a systematic review of documented academic image manipulation cases in the literature (e.g., figure duplication, data fabrication, and AI-assisted enhancement in scholarly publishing). While we acknowledge that exhaustive external validation against undisclosed real-world AI-generated figures is not feasible—due to the absence of large-scale, labeled public datasets of such forgeries—we will add a dedicated subsection in §3 detailing the derivation process, including specific citations to real-world examples and the rationale for each strategy. This will better substantiate the representativeness claim while preserving the diagnostic value demonstrated by the consistently low forensic performance across diverse models. We view this as a partial revision that clarifies rather than alters the core methodology. revision: partial
Referee: [§4] §4 (Evaluation Protocol): Performance numbers are presented as point estimates (e.g., 48.80% overall, 84.74% textual artifact recognition) without reported confidence intervals, per-category variance, or statistical tests for differences between MLLM and expert families. Given the claim of 'complementary strengths' and the use of held-out test images, absence of these measures makes it difficult to judge whether observed gaps are robust or sensitive to image selection.

Authors: We agree that the absence of statistical measures limits the interpretability of the reported gaps. In the revised version, we will augment §4 with 95% bootstrap confidence intervals for all aggregate and per-task metrics (computed over the held-out test set), per-category performance variance, and statistical significance tests (e.g., McNemar's test for binary detection comparisons and paired Wilcoxon tests for localization IoU between MLLM and expert families). These will be added to the main results tables and discussion of complementary strengths. This revision directly addresses the concern and strengthens the empirical claims. revision: yes
Referee: [§2, §5] §2 and §5 (Related Work and Discussion): The selection of the 25 generative models and the four forgery strategies is described at a high level, but the paper does not detail inclusion criteria or whether the set covers recent techniques (e.g., latest diffusion or autoregressive models). This affects the strength of the conclusion that 11 models yield <50% forensic accuracy and that forensics are lagging.

Authors: We will expand the relevant sections in §2 (Related Work) and §5 (Discussion) to explicitly state the inclusion criteria: models were chosen for architectural diversity (GAN, diffusion, and autoregressive families), recency (covering developments through early 2024), community adoption, and demonstrated capability in generating academic-style visuals. We will also add a paragraph acknowledging the rapid evolution of generative techniques and noting that while the set captures major paradigms, it cannot include every post-construction model. This clarification will reinforce the conclusion regarding the 11 models below 50% accuracy without overstating coverage. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

This is an empirical benchmark paper with no mathematical derivations, fitted parameters, or equations. The central claims rest on direct performance measurements (e.g., GPT-5.1 at 48.80% overall accuracy, expert IoU at 30.09%) obtained by running models on a held-out test set constructed from 7 categories, 39 subtypes, and 4 simulated forgery strategies across 25 generators. These metrics are independent observations against the benchmark images rather than quantities defined in terms of the benchmark itself. No self-citation chain, ansatz smuggling, uniqueness theorem, or renaming of known results is load-bearing for the diagnostic-testbed claim. The representativeness of the chosen subtypes is an external validity assumption, not a circular reduction of the reported results to the inputs by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central contribution rests on the assumption that the selected academic domains and forgery simulations capture the primary forensic challenges; no free parameters are fitted and no new physical or mathematical entities are postulated.

axioms (2)

domain assumption The chosen seven academic categories and 39 subtypes adequately represent the intrinsic forensic difficulty of real academic images.
Invoked to justify the claim of domain-specific complexity and the reported performance gaps.
domain assumption The four simulated forgery strategies and 25 generative models produce fakes that match the distribution of actual academic image manipulations.
Underpins the claim that forensics lag behind generative advances.

pith-pipeline@v0.9.0 · 5582 in / 1347 out tokens · 61684 ms · 2026-05-07T07:30:38.284435+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

dots.ocr: Multilingual Document Layout Pars- ing in a Single Vision-Language Model.arXiv preprint arXiv:2512.02498, 2025

On the detection of synthetic images generated by diffusion models. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE. Google DeepMind. 2025a. Gemini 2.5 flash and pro are now generally available, and we’re introducing 2.5 flash-lite, our most cost-efficient and fastest 2.5 model yet. htt...

work page arXiv 2023
[2]

DINOv3

Llava-next: Improved reasoning, ocr, and world knowledge. https://llava-vl.github.io/ blog/2024-01-30-llava-next/ . Accessed: 2025- 04-05. Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. Repaint: Inpainting using denoising diffusion prob- abilistic models. InProceedings of the IEEE/CVF Conference on Compu...

work page internal anchor Pith review arXiv 2024
[3]

Lower FID scores indi- cate higher visual fidelity and closer alignment with real-image statistics

evaluates the distributional similarity be- tween generated images and real academic im- ages by computing the Fréchet distance between their feature embeddings extracted from a pre- trained Inception model. Lower FID scores indi- cate higher visual fidelity and closer alignment with real-image statistics. • CLIP Score.CLIP Score (Hessel et al., 2021) mea...

2021