JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models

Jiachen Qian

arxiv: 2605.28609 · v1 · pith:F6SCZME2new · submitted 2026-05-27 · 💻 cs.CV

JECA²: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models

Jiachen Qian This is my paper

Pith reviewed 2026-06-29 13:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial attacksvision-language modelsforensic image analysisexplanation consistencyGrad-CAMimage tamperingwhite-box attacks

0 comments

The pith

JECA^2 jointly redirects visual attributions via Grad-CAM and optimizes prompt embeddings to force consistent false authenticity judgments and explanations from forensic VLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JECA^2 as a white-box attack that targets both the binary judgment and the accompanying natural-language explanation in forensic vision-language models. Existing attacks often flip the judgment while leaving explanations that still point to tampering, creating detectable inconsistencies. JECA^2 addresses this by guiding image perturbations to shift attention away from tampered areas and constraining prompt changes to produce authenticity-affirming text. A reader would care because these models are intended for reliable forensic use in detecting image manipulation, and consistent mis-explanations could undermine their practical value. The work shows higher attack success and consistency metrics than baselines on benchmark datasets under white-box conditions, with limited transfer to closed models.

Core claim

JECA^2 achieves judgment-explanation consistent adversarial attacks by using Grad-CAM-guided perturbations to divert visual attribution from tampered regions toward benign ones, while optimizing prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint; experiments demonstrate higher attack success rates and automated consistency scores than implemented baselines in white-box settings on forensic VLM benchmarks.

What carries the argument

Grad-CAM-guided image perturbations paired with token-proximity constrained optimization of prompt embeddings to align textual explanations with the target judgment.

If this is right

Forensic VLMs can output both incorrect authenticity judgments and matching explanations that hide tampering evidence.
Joint optimization of visual attribution and textual semantics produces higher consistency than attacks targeting judgment alone.
Transfer of the attack to closed-source VLMs yields measurable but limited success.
Explanation-based forensic systems exhibit a consistency failure mode beyond binary detection errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Forensic VLMs may require training objectives that explicitly penalize explanation inconsistency under perturbation rather than optimizing detection accuracy in isolation.
New evaluation protocols could measure explanation stability across both clean and adversarially perturbed inputs as a standard robustness metric.
Limited transfer success suggests that query-efficient black-box variants might need surrogate models or different optimization strategies to scale.

Load-bearing premise

The method assumes white-box access to model internals including Grad-CAM attributions and prompt embeddings.

What would settle it

Running JECA^2 on a new forensic VLM architecture not included in the original benchmarks and finding no improvement in judgment-explanation consistency over simple judgment-flipping attacks would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2605.28609 by Jiachen Qian.

**Figure 1.** Figure 1: Schematic illustration of the proposed JECA2 against a forensic VLM. The example visualizes the intended false-consistency outcome: the attacked model predicts “Real” and produces an explanation that supports this flipped judgment under the automated consistency metric. instruction prompt as inputs, and output a probability of being fake, a naturallanguage justification, and a mask indicating the tampered… view at source ↗

**Figure 2.** Figure 2: Complete workflow of the JECA2 framework. The visual attention diversion module uses Grad-CAM and bidirectional attention interference to redirect attention from tampered regions (Rtamper) to background decoys (Rbg). The textual explanation alignment module optimizes prompt embeddings toward the target “Real” judgment under a token-proximity constraint; the “Prompt Library” box schematically denotes the vo… view at source ↗

**Figure 3.** Figure 3: Success and failure cases of JECA2 . Top (Success): (a) Face-swap; (b) Inpainting; (c) Hair/boundary manipulation. Bottom (Failure): (d) Full-face manipulation; (e) High-frequency artifacts (ADS=0.52); (f) Multi-region tampering. Because Tab. 2 is conditional on each method’s successful attacks, Neval varies with ASR and the table should be read as an automated consistency proxy rather than a human-valida… view at source ↗

read the original abstract

Forensic vision-language models (VLMs) have recently been developed to detect image tampering and provide natural-language explanations. However, their robustness against adversarial manipulation remains underexplored. Existing adversarial attacks typically aim to flip the model's binary judgment, while the accompanying explanation may still reveal forensic cues and contradict the attacked judgment. In this paper, we study judgment-explanation consistent adversarial attacks against forensic VLMs and propose JECA^2, a controlled white-box red-team diagnostic that jointly redirects visual attribution and aligns textual explanations with the target judgment. On the visual side, JECA^2 uses Grad-CAM-guided perturbations to divert attribution from tampered regions toward benign regions. On the textual side, it optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint. Experiments on forensic VLM benchmarks show that JECA^2 achieves higher attack success and automated judgment-explanation consistency than implemented baselines under white-box threat settings, while transfer to closed-source VLMs remains measurable but limited. Our results highlight a consistency failure mode in explanation-based forensic VLMs and motivate future robustness evaluation beyond binary detection accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JECA^2 adds a joint visual-textual attack that forces explanation consistency after judgment flip in forensic VLMs, but the gains rest on white-box access and unshown experimental details.

read the letter

The paper's core move is to attack both the binary judgment and the accompanying explanation at the same time. It uses Grad-CAM to steer visual attributions away from tampered areas and optimizes prompt embeddings under a token-proximity constraint so the generated text supports the flipped judgment. That combination is new relative to earlier binary-flip attacks.

It does one thing cleanly: it names the consistency failure mode explicitly and keeps the threat model scoped to white-box access. The abstract also notes that transfer to closed-source models stays limited, which avoids overclaiming.

The soft spots are straightforward. The abstract reports higher attack success and consistency metrics than baselines, yet supplies no numbers, variance, or measurement details for the automated consistency score. Without those, it is impossible to tell whether the reported improvement is large enough to matter or just an artifact of the chosen baselines. Everything shown is white-box; the practical red-team value therefore depends on how much the method degrades when only query access is available.

This is a narrow but well-defined piece of work aimed at people already studying VLM robustness in forensic settings. A reader who wants to see the next incremental attack on explanation consistency will get something from it. The central claim does not contain internal contradictions or hidden circularity, so the paper is coherent on its own terms.

I would send it to peer review. The experiments need to be checked for reproducibility and effect size, but the question it poses is legitimate and the method is described at a level that referees can evaluate.

Referee Report

2 major / 2 minor

Summary. The paper introduces JECA^2, a white-box adversarial attack on forensic vision-language models that jointly redirects visual attributions away from tampered regions via Grad-CAM-guided perturbations and aligns textual explanations with a target (authenticity-affirming) judgment by optimizing prompt embeddings under a token-proximity constraint. Experiments on forensic VLM benchmarks report higher attack success rates and automated judgment-explanation consistency than implemented baselines, with measurable but limited transfer to closed-source VLMs.

Significance. If the empirical results hold under the stated white-box threat model, the work is significant for exposing a consistency failure mode in explanation-based forensic systems: attacks can produce internally coherent but incorrect judgment-explanation pairs. This provides a useful red-team diagnostic and motivates robustness benchmarks that go beyond binary detection accuracy.

major comments (2)

[Method (textual side)] The abstract states that JECA^2 'optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint,' but provides no equation or algorithmic description of this constraint or the joint objective; without the precise formulation (e.g., in the method section), it is impossible to verify whether the reported consistency gains are due to the proposed mechanism or to implementation details.
[Experiments] The central claim of 'higher attack success and automated judgment-explanation consistency than implemented baselines' is load-bearing, yet the abstract gives no information on the automated consistency metric, the choice of baselines, or statistical significance of the differences; this makes it difficult to assess whether the gains are robust or merely reflect a particular evaluation protocol.

minor comments (2)

[Abstract] The abstract refers to 'forensic VLM benchmarks' without naming the specific datasets or models used; adding these details would improve reproducibility.
[Visual side] Clarify whether the Grad-CAM guidance is applied only at inference time or also during the perturbation optimization loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional clarity is needed. We address each major comment below and have revised the manuscript to incorporate the requested details on the method formulation and experimental reporting.

read point-by-point responses

Referee: [Method (textual side)] The abstract states that JECA^2 'optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint,' but provides no equation or algorithmic description of this constraint or the joint objective; without the precise formulation (e.g., in the method section), it is impossible to verify whether the reported consistency gains are due to the proposed mechanism or to implementation details.

Authors: We agree that the method section requires an explicit mathematical formulation to allow verification of the mechanism. In the revised manuscript, we have added the precise equation for the token-proximity constraint (now Equation 4 in Section 3.2) and the full joint objective function combining visual attribution redirection and textual alignment losses. We have also included a step-by-step algorithmic description (Algorithm 1) detailing the optimization procedure under the constraint. These additions directly link the reported consistency improvements to the proposed components. revision: yes
Referee: [Experiments] The central claim of 'higher attack success and automated judgment-explanation consistency than implemented baselines' is load-bearing, yet the abstract gives no information on the automated consistency metric, the choice of baselines, or statistical significance of the differences; this makes it difficult to assess whether the gains are robust or merely reflect a particular evaluation protocol.

Authors: We acknowledge that the abstract and initial experimental description lacked sufficient detail on these elements. The revised manuscript now defines the automated judgment-explanation consistency metric explicitly in Section 4.1 (including its computation via semantic similarity and forensic cue alignment scores), lists all baselines with implementation details and selection rationale in Section 4.2, and reports statistical significance via paired t-tests with p-values for the differences in attack success rates and consistency metrics across the benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical optimization procedure for a white-box adversarial attack (JECA^2) that redirects Grad-CAM attributions and aligns prompt embeddings, with results reported from benchmark experiments. No equations, fitted parameters renamed as predictions, self-citations, or derivation steps are described that would reduce any claimed result to the method inputs by construction. The threat model is explicitly scoped to white-box access, and the consistency claims rest on external experimental comparisons rather than internal self-definition or imported uniqueness theorems. This is a standard non-circular empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5720 in / 1095 out tokens · 38949 ms · 2026-06-29T13:16:10.436149+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 21 canonical work pages · 5 internal anchors

[1]

Information Fusion107, 102303 (2024)

Baniecki, H., Biecek, P.: Adversarial attacks and defenses in explainable artificial intelligence: A survey. Information Fusion107, 102303 (2024)

2024
[2]

In: Proceedings of the IEEE Symposium on Security and Privacy (S&P)

Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: Proceedings of the IEEE Symposium on Security and Privacy (S&P). pp. 39–57 (2017) JECA2: Judgment–Explanation Consistent Adversarial Attack 35

2017
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (2021)

2021
[4]

In: Proceedings of the ACM Workshop on Information Hiding and Multimedia Security

Cozzolino, D., Poggi, G., Verdoliva, L.: Recasting residual-based local descriptors as convolutional neural networks: An application to image forgery detection. In: Proceedings of the ACM Workshop on Information Hiding and Multimedia Security. pp. 159–164 (2017)

2017
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Dong, X., Bao, J., Chen, D., Zhang, T., Zhang, W., Yu, N., Chen, D., Wen, F., Guo, B.: Protecting celebrities from deepfake with identity consistency transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9468–9478 (2022)

2022
[6]

In: Proceedings of the International Conference on Machine Learning (ICML)

Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever- aging frequency analysis for deep fake image recognition. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 3247–3258 (2020)

2020
[7]

arXiv preprint arXiv:2403.10883 (2024)

Fu, J., Chen, Z., Jiang, K., Guo, H., Wang, J., Gao, S., Zhang, W.: Improving adver- sarial transferability of vision-language pre-training models through collaborative multimodal interaction. arXiv preprint arXiv:2403.10883 (2024)

work page arXiv 2024
[8]

In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)

Gao, K., Bai, Y., Bai, J., Yang, Y., Xia, S.T.: Adversarial robustness for visual grounding of multimodal large language models. In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)

2024
[9]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial ex- amples. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

2015
[10]

arXiv preprint arXiv:2408.13461 (2024)

Guan, J., Ding, T., Cao, L., Pan, L., Wang, C., Zheng, X.: Probing the robustness of vision-language pretrained models: A multimodal adversarial attack approach. arXiv preprint arXiv:2408.13461 (2024)

work page arXiv 2024
[11]

In: Proceedings of the International Conference on Machine Learning (ICML)

Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., Wang, D.: Improving interpretation faithfulness for vision transformers. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 19344–19370 (2024)

2024
[12]

arXiv preprint arXiv:2408.10072 (2024)

Huang, Z., Xia, B., Lin, Z., Mou, Z., Yang, W., Jia, J.: FFAA: Multimodal large language model based explainable open-world face forgery analysis assistant. arXiv preprint arXiv:2408.10072 (2024)

work page arXiv 2024
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Huang, Z., Hu, J., Li, X., He, Y., Zhao, X., Peng, B., Wu, B., Huang, X., Cheng, G.: SIDA: Social media image deepfake detection, localization and explanation with large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 28831–28841 (2025)

2025
[14]

arXiv preprint arXiv:2505.18660 (2025)

Huang, Z., Li, T., Li, X., Wen, H., He, Y., Zhang, J., Fei, H., Yang, X., Huang, X., Peng, B., Cheng, G.: So-fake: Benchmarking and explaining social media image forgery detection. arXiv preprint arXiv:2505.18660 (2025)

work page arXiv 2025
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Le, T.N., Nguyen, H.H., Yamagishi, J., Echizen, I.: OpenForensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10117–10127 (2021)

2021
[16]

arXiv preprint arXiv:2602.06530 (2026)

Li, H., Peng, R., Luo, A., Tan, S., Chen, C., Antsiferova, A.: Universal anti-forensics attack against image forgery detection via multi-modal guidance. arXiv preprint arXiv:2602.06530 (2026)

work page arXiv 2026
[17]

In: IEEE International Workshop on Information Forensics and Security (WIFS)

Li, Y., Chang, M.C., Lyu, S.: In ictu oculi: Exposing AI created fake videos by detecting eye blinking. In: IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–7 (2018).https://doi.org/10.1109/WIFS.2018. 8630787 36 Qian

work page doi:10.1109/wifs.2018 2018
[18]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

Liu, X., Xu, N., Chen, M., Xiao, C.: AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

2024
[19]

In: Proceedings of the Interna- tional Conference on Learning Representations (ICLR) (2024), arXiv:2403.09766

Luo, H., Gu, J., Liu, F., Torr, P.: An image is worth 1000 lies: Adversarial transfer- ability across prompts on vision-language models. In: Proceedings of the Interna- tional Conference on Learning Representations (ICLR) (2024), arXiv:2403.09766

work page arXiv 2024
[20]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learn- ing models resistant to adversarial attacks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)

2018
[21]

In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

Mo, X., Tan, S., Li, B., Huang, J.: Query-efficient attack for black-box image inpainting forensics via reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). vol. 39, pp. 19503–19511 (2025). https://doi.org/10.1609/aaai.v39i18.34147

work page doi:10.1609/aaai.v39i18.34147 2025
[22]

SARA: Stress Test Reasoning in Audio Deepfake Detection

Nguyen, B., Le, T.: Analyzing reasoning shifts in audio deepfake detection under adversarial attacks: The reasoning tax versus shield bifurcation. arXiv preprint arXiv:2601.03615 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

In: Proceedings of the International Conference on Machine Learning (ICML)

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., Anandkumar, A.: Diffusion models for adversarial purification. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 16805–16827 (2022)

2022
[24]

arXiv preprint arXiv:2508.07402 (2025)

Peng, R., Tan, S., Kong, C., Luo, A., Kot, A.C., Huang, J.: ForensicsSAM: Toward robust and unified image forgery detection and localization resisting to adversarial attack. arXiv preprint arXiv:2508.07402 (2025)

work page arXiv 2025
[25]

arXiv preprint arXiv:2403.02955 (2024), accepted at TMLR 2024

Pinhasov, B., Lapid, R., Ohayon, R., Sipper, M., Aperstein, Y.: XAI-based detection of adversarial attacks on deepfake detectors. arXiv preprint arXiv:2403.02955 (2024), accepted at TMLR 2024

work page arXiv 2024
[26]

Visual Inception: Compromising Long-term Planning in Agentic Recommenders via Multimodal Memory Poisoning

Qian, J.: Visual inception: Compromising long-term planning in agentic recom- menders via multimodal memory poisoning. arXiv preprint arXiv:2604.16966 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations

Qian, J., Kang, Z.: Penny wise, pixel foolish: Bypassing price constraints in multi- modal agents via visual adversarial perturbations. arXiv preprint arXiv:2604.16515 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2023), arXiv:2308.10741

Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2023), arXiv:2308.10741

work page arXiv 2023
[29]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017)

2017
[30]

arXiv preprint arXiv:2509.14957 (2025)

Shen, Z., Zhang, K., Jia, B., Jia, H., Fang, Y., Yu, Z., Lin, S.: DF-LLaVA: Unlocking MLLMs for synthetic image detection via knowledge injection and conflict-driven self-reflection. arXiv preprint arXiv:2509.14957 (2025)

work page arXiv 2025
[31]

In: Multimedia Forensics, pp

Stamm, M.C., Zhao, X.: Anti-forensic attacks using generative adversarial networks. In: Multimedia Forensics, pp. 467–490. Advances in Computer Vision and Pattern Recognition, Springer Singapore (2022).https://doi.org/10.1007/978-981-16- 7621-5_17

work page doi:10.1007/978-981-16- 2022
[32]

In: Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR)

Wang, J., Wu, Z., Ouyang, W., Han, X., Chen, J., Jiang, Y.G., Lim, S.N.: M2TR: Multi-modal multi-scale transformers for deepfake detection. In: Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR). pp. 615–623 (2022).https://doi.org/10.1145/3512527.3531415

work page doi:10.1145/3512527.3531415 2022
[33]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wei, Z., Chen, J., Goldblum, M., Wu, Z., Goldstein, T., Jiang, Y.G.: Towards transferable adversarial attacks on vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 2668–2676 (2022) JECA2: Judgment–Explanation Consistent Adversarial Attack 37

2022
[34]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2025)

Xu, Z., Zhang, X., Li, R., Tang, Z., Huang, Q., Zhang, J.: FakeShield: Explainable image forgery detection and localization via multi-modal large language models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2025)

2025
[35]

ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

Zhang, F., Liu, J., Zhu, J., Sun, E., Li, D., Zhang, Q., Zha, Z.J.: ForgeryGPT: A multimodal LLM for interpretable image forgery detection and localization. arXiv preprint arXiv:2410.10238 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, J., Ye, J., Ma, X., Li, Y., Yang, Y., Chen, Y., Sang, J., Yeung, D.Y.: Any- Attack: Towards large-scale self-supervised adversarial attacks on vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19900–19909 (2025)

2025
[37]

arXiv preprint arXiv:2404.19287 (2024)

Zhou, W., Bai, S., Mandic, D.P., Zhao, Q., Chen, B.: Revisiting the adversarial robustness of vision language models: a multimodal perspective. arXiv preprint arXiv:2404.19287 (2024)

work page arXiv 2024
[38]

IEEE Transactions on Dependable and Secure Computing22(2), 852–869 (2025).https://doi.org/10.1109/TDSC

Zhuo, L., Luo, S., Tan, S., Chen, H., Li, B., Huang, J.: Evading detection actively: Toward anti-forensics against forgery localization. IEEE Transactions on Dependable and Secure Computing22(2), 852–869 (2025).https://doi.org/10.1109/TDSC. 2025.3528062

work page doi:10.1109/tdsc 2025
[39]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Information Fusion107, 102303 (2024)

Baniecki, H., Biecek, P.: Adversarial attacks and defenses in explainable artificial intelligence: A survey. Information Fusion107, 102303 (2024)

2024

[2] [2]

In: Proceedings of the IEEE Symposium on Security and Privacy (S&P)

Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: Proceedings of the IEEE Symposium on Security and Privacy (S&P). pp. 39–57 (2017) JECA2: Judgment–Explanation Consistent Adversarial Attack 35

2017

[3] [3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 9650–9660 (2021)

2021

[4] [4]

In: Proceedings of the ACM Workshop on Information Hiding and Multimedia Security

Cozzolino, D., Poggi, G., Verdoliva, L.: Recasting residual-based local descriptors as convolutional neural networks: An application to image forgery detection. In: Proceedings of the ACM Workshop on Information Hiding and Multimedia Security. pp. 159–164 (2017)

2017

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Dong, X., Bao, J., Chen, D., Zhang, T., Zhang, W., Yu, N., Chen, D., Wen, F., Guo, B.: Protecting celebrities from deepfake with identity consistency transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9468–9478 (2022)

2022

[6] [6]

In: Proceedings of the International Conference on Machine Learning (ICML)

Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever- aging frequency analysis for deep fake image recognition. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 3247–3258 (2020)

2020

[7] [7]

arXiv preprint arXiv:2403.10883 (2024)

Fu, J., Chen, Z., Jiang, K., Guo, H., Wang, J., Gao, S., Zhang, W.: Improving adver- sarial transferability of vision-language pre-training models through collaborative multimodal interaction. arXiv preprint arXiv:2403.10883 (2024)

work page arXiv 2024

[8] [8]

In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)

Gao, K., Bai, Y., Bai, J., Yang, Y., Xia, S.T.: Adversarial robustness for visual grounding of multimodal large language models. In: ICLR 2024 Workshop on Reliable and Responsible Foundation Models (2024)

2024

[9] [9]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial ex- amples. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

2015

[10] [10]

arXiv preprint arXiv:2408.13461 (2024)

Guan, J., Ding, T., Cao, L., Pan, L., Wang, C., Zheng, X.: Probing the robustness of vision-language pretrained models: A multimodal adversarial attack approach. arXiv preprint arXiv:2408.13461 (2024)

work page arXiv 2024

[11] [11]

In: Proceedings of the International Conference on Machine Learning (ICML)

Hu, L., Liu, Y., Liu, N., Huai, M., Sun, L., Wang, D.: Improving interpretation faithfulness for vision transformers. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 19344–19370 (2024)

2024

[12] [12]

arXiv preprint arXiv:2408.10072 (2024)

Huang, Z., Xia, B., Lin, Z., Mou, Z., Yang, W., Jia, J.: FFAA: Multimodal large language model based explainable open-world face forgery analysis assistant. arXiv preprint arXiv:2408.10072 (2024)

work page arXiv 2024

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Huang, Z., Hu, J., Li, X., He, Y., Zhao, X., Peng, B., Wu, B., Huang, X., Cheng, G.: SIDA: Social media image deepfake detection, localization and explanation with large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 28831–28841 (2025)

2025

[14] [14]

arXiv preprint arXiv:2505.18660 (2025)

Huang, Z., Li, T., Li, X., Wen, H., He, Y., Zhang, J., Fei, H., Yang, X., Huang, X., Peng, B., Cheng, G.: So-fake: Benchmarking and explaining social media image forgery detection. arXiv preprint arXiv:2505.18660 (2025)

work page arXiv 2025

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Le, T.N., Nguyen, H.H., Yamagishi, J., Echizen, I.: OpenForensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10117–10127 (2021)

2021

[16] [16]

arXiv preprint arXiv:2602.06530 (2026)

Li, H., Peng, R., Luo, A., Tan, S., Chen, C., Antsiferova, A.: Universal anti-forensics attack against image forgery detection via multi-modal guidance. arXiv preprint arXiv:2602.06530 (2026)

work page arXiv 2026

[17] [17]

In: IEEE International Workshop on Information Forensics and Security (WIFS)

Li, Y., Chang, M.C., Lyu, S.: In ictu oculi: Exposing AI created fake videos by detecting eye blinking. In: IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–7 (2018).https://doi.org/10.1109/WIFS.2018. 8630787 36 Qian

work page doi:10.1109/wifs.2018 2018

[18] [18]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

Liu, X., Xu, N., Chen, M., Xiao, C.: AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

2024

[19] [19]

In: Proceedings of the Interna- tional Conference on Learning Representations (ICLR) (2024), arXiv:2403.09766

Luo, H., Gu, J., Liu, F., Torr, P.: An image is worth 1000 lies: Adversarial transfer- ability across prompts on vision-language models. In: Proceedings of the Interna- tional Conference on Learning Representations (ICLR) (2024), arXiv:2403.09766

work page arXiv 2024

[20] [20]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learn- ing models resistant to adversarial attacks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)

2018

[21] [21]

In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

Mo, X., Tan, S., Li, B., Huang, J.: Query-efficient attack for black-box image inpainting forensics via reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). vol. 39, pp. 19503–19511 (2025). https://doi.org/10.1609/aaai.v39i18.34147

work page doi:10.1609/aaai.v39i18.34147 2025

[22] [22]

SARA: Stress Test Reasoning in Audio Deepfake Detection

Nguyen, B., Le, T.: Analyzing reasoning shifts in audio deepfake detection under adversarial attacks: The reasoning tax versus shield bifurcation. arXiv preprint arXiv:2601.03615 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

In: Proceedings of the International Conference on Machine Learning (ICML)

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., Anandkumar, A.: Diffusion models for adversarial purification. In: Proceedings of the International Conference on Machine Learning (ICML). pp. 16805–16827 (2022)

2022

[24] [24]

arXiv preprint arXiv:2508.07402 (2025)

Peng, R., Tan, S., Kong, C., Luo, A., Kot, A.C., Huang, J.: ForensicsSAM: Toward robust and unified image forgery detection and localization resisting to adversarial attack. arXiv preprint arXiv:2508.07402 (2025)

work page arXiv 2025

[25] [25]

arXiv preprint arXiv:2403.02955 (2024), accepted at TMLR 2024

Pinhasov, B., Lapid, R., Ohayon, R., Sipper, M., Aperstein, Y.: XAI-based detection of adversarial attacks on deepfake detectors. arXiv preprint arXiv:2403.02955 (2024), accepted at TMLR 2024

work page arXiv 2024

[26] [26]

Visual Inception: Compromising Long-term Planning in Agentic Recommenders via Multimodal Memory Poisoning

Qian, J.: Visual inception: Compromising long-term planning in agentic recom- menders via multimodal memory poisoning. arXiv preprint arXiv:2604.16966 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations

Qian, J., Kang, Z.: Penny wise, pixel foolish: Bypassing price constraints in multi- modal agents via visual adversarial perturbations. arXiv preprint arXiv:2604.16515 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2023), arXiv:2308.10741

Schlarmann, C., Hein, M.: On the adversarial robustness of multi-modal foundation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2023), arXiv:2308.10741

work page arXiv 2023

[29] [29]

In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017)

2017

[30] [30]

arXiv preprint arXiv:2509.14957 (2025)

Shen, Z., Zhang, K., Jia, B., Jia, H., Fang, Y., Yu, Z., Lin, S.: DF-LLaVA: Unlocking MLLMs for synthetic image detection via knowledge injection and conflict-driven self-reflection. arXiv preprint arXiv:2509.14957 (2025)

work page arXiv 2025

[31] [31]

In: Multimedia Forensics, pp

Stamm, M.C., Zhao, X.: Anti-forensic attacks using generative adversarial networks. In: Multimedia Forensics, pp. 467–490. Advances in Computer Vision and Pattern Recognition, Springer Singapore (2022).https://doi.org/10.1007/978-981-16- 7621-5_17

work page doi:10.1007/978-981-16- 2022

[32] [32]

In: Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR)

Wang, J., Wu, Z., Ouyang, W., Han, X., Chen, J., Jiang, Y.G., Lim, S.N.: M2TR: Multi-modal multi-scale transformers for deepfake detection. In: Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR). pp. 615–623 (2022).https://doi.org/10.1145/3512527.3531415

work page doi:10.1145/3512527.3531415 2022

[33] [33]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wei, Z., Chen, J., Goldblum, M., Wu, Z., Goldstein, T., Jiang, Y.G.: Towards transferable adversarial attacks on vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 2668–2676 (2022) JECA2: Judgment–Explanation Consistent Adversarial Attack 37

2022

[34] [34]

In: Proceedings of the International Conference on Learning Representations (ICLR) (2025)

Xu, Z., Zhang, X., Li, R., Tang, Z., Huang, Q., Zhang, J.: FakeShield: Explainable image forgery detection and localization via multi-modal large language models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2025)

2025

[35] [35]

ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

Zhang, F., Liu, J., Zhu, J., Sun, E., Li, D., Zhang, Q., Zha, Z.J.: ForgeryGPT: A multimodal LLM for interpretable image forgery detection and localization. arXiv preprint arXiv:2410.10238 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, J., Ye, J., Ma, X., Li, Y., Yang, Y., Chen, Y., Sang, J., Yeung, D.Y.: Any- Attack: Towards large-scale self-supervised adversarial attacks on vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19900–19909 (2025)

2025

[37] [37]

arXiv preprint arXiv:2404.19287 (2024)

Zhou, W., Bai, S., Mandic, D.P., Zhao, Q., Chen, B.: Revisiting the adversarial robustness of vision language models: a multimodal perspective. arXiv preprint arXiv:2404.19287 (2024)

work page arXiv 2024

[38] [38]

IEEE Transactions on Dependable and Secure Computing22(2), 852–869 (2025).https://doi.org/10.1109/TDSC

Zhuo, L., Luo, S., Tan, S., Chen, H., Li, B., Huang, J.: Evading detection actively: Toward anti-forensics against forgery localization. IEEE Transactions on Dependable and Secure Computing22(2), 852–869 (2025).https://doi.org/10.1109/TDSC. 2025.3528062

work page doi:10.1109/tdsc 2025

[39] [39]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023