arxiv: 2604.19844 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Jiamin Chang , Minhui Xue , Ruoxi Sun , Shuchao Pang , Salil S. Kanhere , Hammond Pearce

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language agentsvisual injectionstrust boundary confusionmulti-agent defenseembodied AILVLM robustnessadversarial perturbations

0 comments

The pith

A multi-agent framework separates perception from decision-making in vision-language agents to follow legitimate signals while resisting misleading visual injections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Embodied vision-language agents must use real environmental cues such as traffic lights to guide actions. The same visual channels can carry crafted injections that override user intent and create security risks. Current agents powered by large vision-language models often either ignore valid signals or accept harmful ones. The paper builds a dual-intent dataset and evaluation framework to measure this failure across seven agents and multiple embodied tasks. It then introduces a multi-agent defense that splits perception from decision-making so one part can check input reliability before the other acts.

Core claim

Current LVLM-based agents fail to reliably balance responding to legitimate environmental cues while remaining robust to misleading visual injections, either ignoring useful signals or following harmful ones. A multi-agent defense framework that separates perception from decision-making dynamically assesses the reliability of visual inputs. This approach significantly reduces misleading behaviors while preserving correct responses and provides robustness guarantees under adversarial perturbations.

What carries the argument

A multi-agent defense framework that separates perception from decision-making to dynamically assess the reliability of visual inputs.

Load-bearing premise

The dual-intent dataset and embodied settings used in evaluation sufficiently represent real-world visual injection scenarios and that the multi-agent separation can be implemented without new failure modes or high overhead.

What would settle it

An experiment showing that agents using the multi-agent framework still follow harmful visual injections at high rates in embodied tasks or fail to respond to legitimate signals would falsify the reduction claim.

Figures

Figures reproduced from arXiv: 2604.19844 by Hammond Pearce, Jiamin Chang, Minhui Xue, Ruoxi Sun, Salil S. Kanhere, Shuchao Pang.

**Figure 2.** Figure 2: Three examples of vision prompt injection in au [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Injection sample visualization across structure-based [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: State-of-the-art LVLMs demonstrate a high text detec [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Success rate of VLAS under naive visual injection. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-agent defense framework for VLAS. An Observation Agent extracts environment instructions, a Judgment Agent [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Injection success rate of our Multi-agent mitigation [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Case study of visual injection in a manipulation embodied AI (GPT-4o based) scenario. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Visual injection in urban driving [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Visual injection in emergency landing. Success Rate (JSR) exceeding 66%. These results indicate that purification removes most misleading noise but may oversuppress benign content, whereas combining purification with filtering yields stronger, more balanced protection. Remaining gaps stem from limited sensitivity to benign noise patterns, where low DSR can propagate to lower JSR. Mitigation performance a… view at source ↗

read the original abstract

Recent advances in embodied Vision-Language Agentic Systems (VLAS), powered by large vision-language models (LVLMs), enable AI systems to perceive and reason over real-world scenes. Within this context, environmental signals such as traffic lights are essential in-band signals that can and should influence agent behavior. However, similar signals could also be crafted to operate as misleading visual injections, overriding user intent and posing security risks. This duality creates a fundamental challenge: agents must respond to legitimate environmental cues while remaining robust to misleading ones. We refer to this tension as trust boundary confusion. To study this behavior, we design a dual-intent dataset and evaluation framework, through which we show that current LVLM-based agents fail to reliably balance this trade-off, either ignoring useful signals or following harmful ones. We systematically evaluate 7 LVLM agents across multiple embodied settings under both structure-based and noise-based visual injections. To address these vulnerabilities, we propose a multi-agent defense framework that separates perception from decision-making to dynamically assess the reliability of visual inputs. Our approach significantly reduces misleading behaviors while preserving correct responses and provides robustness guarantees under adversarial perturbations. The code of the evaluation framework and artifacts are made available at https://anonymous.4open.science/r/Visual-Prompt-Inject.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a real tension in embodied VL agents between heeding legitimate signals and resisting injected ones, then tests a multi-agent split that cuts misleading behavior in their simulated setups.

read the letter

The core contribution is the framing of trust boundary confusion for visual injections in VLAS, plus a dual-intent dataset that pairs useful environmental cues with crafted misleading ones. They run this across seven LVLM agents in embodied simulations under structure and noise attacks, showing agents tend to either ignore good signals or follow bad ones. The proposed defense splits perception from decision-making so one component can vet input reliability before the other acts. That setup reduces the bad behaviors while keeping correct responses in their tests, and they release the evaluation code and artifacts. This is a straightforward extension of prior adversarial work into agentic embodied settings, and the multi-model evaluation plus open artifacts make the results easier to check. The main limitation is that all the data and attacks live in controlled simulations. Real-world visual injections will include variable lighting, placement, motion, and sensor noise that the dual-intent construction may not fully capture, so the reported gains and any robustness claims are tied to the testbed for now. No formal guarantees appear; the robustness is empirical. The work is aimed at researchers studying safety and adversarial robustness for vision-language agents in physical environments. It is coherent on its own terms and worth sending to referees because the problem is timely and the evaluation is broader than a single model, even if later revisions will need stronger real-world validation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the problem of trust boundary confusion in embodied Vision-Language Agentic Systems (VLAS), where agents must respond to legitimate in-band environmental signals (e.g., traffic lights) while resisting crafted misleading visual injections. Using a newly designed dual-intent dataset, the authors evaluate seven LVLM-based agents across multiple embodied settings under structure-based and noise-based injections, showing that current agents either ignore useful signals or follow harmful ones. They propose a multi-agent defense framework that separates perception from decision-making to assess visual input reliability, claiming it significantly reduces misleading behaviors while preserving correct responses and providing robustness guarantees under adversarial perturbations. The evaluation framework and artifacts are released publicly.

Significance. If the central claims hold, the work is significant for highlighting a practical security and reliability challenge in emerging embodied VLAS applications. The multi-agent separation defense offers a plausible architectural mitigation that balances utility and robustness. The open release of code and artifacts is a clear strength that enables reproducibility and community follow-up. However, the overall significance hinges on whether the synthetic dual-intent evaluations generalize beyond the simulated regime to real-world noisy, dynamic environments.

major comments (2)

[§4] §4 (Evaluation Framework and Dual-Intent Dataset): The central claims—that current agents fail the trade-off and that the defense reduces misleading behaviors while preserving correct responses—rest on results from the constructed dual-intent dataset in embodied simulations. The paper must demonstrate that the dataset faithfully captures the distribution of legitimate signals versus injections without introducing synthetic artifacts (e.g., unnatural placement, lighting, or phrasing) absent from physical deployments; otherwise both the observed failure modes and reported robustness gains risk being testbed-specific rather than intrinsic.
[§5] §5 (Multi-Agent Defense Framework): The claim of 'robustness guarantees under adversarial perturbations' is load-bearing for the defense contribution. The manuscript should clarify whether these guarantees are formal (e.g., derived bounds or proofs) or purely empirical, and address whether the perception-decision separation can be implemented without introducing new failure modes or prohibitive overhead, as implicitly assumed in the evaluation.

minor comments (2)

[Abstract] Abstract: The statement that the approach 'significantly reduces misleading behaviors' would benefit from a brief quantitative highlight (e.g., percentage reduction or key metric values) to give readers an immediate sense of effect size.
[Introduction] Notation and Terminology: The term 'trust boundary confusion' is introduced as a new concept; ensure its definition is clearly distinguished from related notions such as visual prompt injection or adversarial robustness in the introduction and related-work sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of our evaluation framework and defense approach.

read point-by-point responses

Referee: §4 (Evaluation Framework and Dual-Intent Dataset): The central claims—that current agents fail the trade-off and that the defense reduces misleading behaviors while preserving correct responses—rest on results from the constructed dual-intent dataset in embodied simulations. The paper must demonstrate that the dataset faithfully captures the distribution of legitimate signals versus injections without introducing synthetic artifacts (e.g., unnatural placement, lighting, or phrasing) absent from physical deployments; otherwise both the observed failure modes and reported robustness gains risk being testbed-specific rather than intrinsic.

Authors: We agree that the fidelity of the dual-intent dataset is central to the validity of our claims. The dataset was constructed within standard embodied simulators using object placements, lighting conditions, and signal phrasings drawn from real-world references (e.g., standard traffic signage and common household object interactions). In the revised manuscript we have expanded §4 with an explicit subsection on dataset design choices, including the use of randomized but physically plausible camera angles, varied illumination, and balanced legitimate versus injected signal distributions. We also added sensitivity experiments that perturb placement and lighting parameters and show that the reported failure modes and defense gains remain consistent. We acknowledge that these steps do not replace physical-robot validation and have added a limitations paragraph noting this as an important direction for follow-up work. revision: partial
Referee: §5 (Multi-Agent Defense Framework): The claim of 'robustness guarantees under adversarial perturbations' is load-bearing for the defense contribution. The manuscript should clarify whether these guarantees are formal (e.g., derived bounds or proofs) or purely empirical, and address whether the perception-decision separation can be implemented without introducing new failure modes or prohibitive overhead, as implicitly assumed in the evaluation.

Authors: The robustness claims in the original manuscript were empirical, derived from repeated trials across structure-based and noise-based perturbations. We have revised §5 to state this explicitly and to remove any phrasing that could be read as implying formal bounds or proofs. On the implementation side, the multi-agent separation adds a lightweight perception agent that outputs a reliability score before the decision agent proceeds; our updated evaluation reports the added latency (approximately 15-25 % depending on the base LVLM) and confirms that no new failure modes were observed in the tested regimes. We have inserted a short analysis subsection discussing overhead, potential edge cases (e.g., ambiguous reliability scores), and simple mitigation rules, all supported by the same experimental setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation and defense proposal remain independent of inputs

full rationale

The paper's core chain consists of (1) constructing a dual-intent dataset and evaluation framework, (2) empirically demonstrating that 7 LVLM agents fail to balance legitimate signals versus visual injections, and (3) proposing and testing a multi-agent separation defense that reduces misleading behaviors while preserving correct responses. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The dataset and framework are introduced as new artifacts whose results are reported separately; the defense is a distinct architectural proposal whose performance is measured on those artifacts rather than derived from them by construction. The abstract's reference to 'robustness guarantees' is presented as an outcome of the evaluation, not a mathematical reduction to prior inputs. This is a standard empirical security paper with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on domain assumptions about LVLM behavior in embodied settings and the feasibility of crafting visual injections that mimic legitimate signals; the new term trust boundary confusion is introduced as a framing device without independent evidence beyond the paper's own experiments.

axioms (2)

domain assumption Environmental signals such as traffic lights are essential in-band signals that can and should influence agent behavior.
Stated directly in the abstract as the foundation for the duality between legitimate and misleading signals.
domain assumption Similar signals could also be crafted to operate as misleading visual injections overriding user intent.
Core premise enabling the security risk discussion and evaluation framework.

invented entities (1)

trust boundary confusion no independent evidence
purpose: To name the tension between responding to legitimate environmental cues and remaining robust to misleading visual injections.
Newly coined term used to frame the central challenge studied in the paper.

pith-pipeline@v0.9.0 · 5555 in / 1439 out tokens · 52801 ms · 2026-05-10T02:49:43.653806+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
cs.RO 2026-04 accept novelty 4.0

A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.

Reference graph

Works this paper leans on

44 extracted references · 14 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

GPT-5 System Card,

OpenAI, “GPT-5 System Card,” 2025. [Online]. Available: https://cdn.openai.com/gpt-5-system-card.pdf

2025
[2]

GPT-4o System Card

——, “GPT-4o System Card,” 2024. [Online]. Available: https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inComputer Vision and Pattern Recognition Conference, 2025

2025
[5]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of 13 autonomous driving and large vision-language models,”arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review arXiv 2024
[6]

Agent as cerebrum, controller as cerebellum: Implementing an embodied lmm-based agent on drones,

H. Zhao, F. Pan, H. Ping, and Y . Zhou, “Agent as cerebrum, controller as cerebellum: Implementing an embodied lmm-based agent on drones,”arXiv preprint arXiv:2311.15033, 2023

work page arXiv 2023
[7]

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer.arXiv e-prints, page arXiv:2510.03342, October 2025

G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Be- wley, J. Bingham, M. Bloeschet al., “Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embod- ied reasoning, thinking, and motion transfer,”arXiv preprint arXiv:2510.03342, 2025

work page arXiv 2025
[8]

Scenetap: Scene-coherent typographic adversarial planner against vision-language models in real-world environ- ments,

Y . Cao, Y . Xing, J. Zhang, D. Lin, T. Zhang, I. Tsang, Y . Liu, and Q. Guo, “Scenetap: Scene-coherent typographic adversarial planner against vision-language models in real-world environ- ments,” inComputer Vision and Pattern Recognition, 2025

2025
[9]

CHAI: Command hijacking against em- bodied AI,

L. Burbano, D. Ortiz, Q. Sun, S. Yang, H. Tu, C. Xie, Y . Cao, and A. A. Cardenas, “CHAI: Command hijacking against em- bodied AI,” inIEEE Conference on Secure and Trustworthy Machine Learning, 2026

2026
[10]

The protection of informa- tion in computer systems,

J. H. Saltzer and M. D. Schroeder, “The protection of informa- tion in computer systems,”Proceedings of the IEEE, vol. 63, no. 9, pp. 1278–1308, 1975

1975
[11]

Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injec- tion,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injec- tion,” inProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023

2023
[12]

Announcing the “ai agent standards initiative

NIST, “Announcing the “ai agent standards initiative” for interoperable and secure innovation,” https: //www.nist.gov/news-events/news/2026/02/announcing-ai-a gent-standards-initiative-interoperable-and-secure, 2026

2026
[13]

A new era in llm security: Exploring security concerns in real-world llm-based systems,

F. Wu, N. Zhang, S. Jha, P. McDaniel, and C. Xiao, “A new era in LLM security: Exploring security concerns in real-world LLM-based systems,”arXiv preprint arXiv:2402.18649, 2024

work page arXiv 2024
[14]

BadRobot: Jailbreaking embod- ied LLMs in the physical world,

H. Zhang, C. Zhu, X. Wang, Z. Zhou, C. Yin, M. Li, L. Xue, Y . Wang, S. Hu, A. Liuet al., “BadRobot: Jailbreaking embod- ied LLMs in the physical world,”International Conference on Learning Representations, 2025

2025
[15]

Figstep: Jailbreaking large vision-language mod- els via typographic visual prompts,

Y . Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language mod- els via typographic visual prompts,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[16]

Envinjection: Environmental prompt injection attack to multi-modal web agents,

X. Wang, J. Bloch, Z. Shao, Y . Hu, S. Zhou, and N. Zhen- qiang Gong, “Envinjection: Environmental prompt injection attack to multi-modal web agents,” inEmpirical Methods in Natural Language Processing (EMNLP), 2025

2025
[17]

Dissecting adversarial robustness of multimodal lm agents,

C. H. Wu, R. Shah, J. Y . Koh, R. Salakhutdinov, D. Fried, and A. Raghunathan, “Dissecting adversarial robustness of multimodal lm agents,” 2025

2025
[18]

Prompt-to-SQL injections in LLM-integrated web applications: Risks and defenses,

R. Pedro, M. E. Coimbra, D. Castro, P. Carreira, and N. Santos, “Prompt-to-SQL injections in LLM-integrated web applications: Risks and defenses,” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, 2025

2025
[19]

Formalizing and benchmarking prompt injection attacks and defenses,

Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and benchmarking prompt injection attacks and defenses,” in 33rd USENIX Security Symposium, 2024

2024
[20]

Multimodal situational safety,

K. Zhou, C. Liu, X. Zhao, A. Compalas, D. Song, and X. E. Wang, “Multimodal situational safety,” inInternational Confer- ence on Learning Representations, 2025

2025
[21]

Instructpix2pix: Learning to follow image editing instructions,

T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” inComputer Vision and Pattern Recognition, 2023

2023
[22]

Visual programming: Composi- tional visual reasoning without training,

T. Gupta and A. Kembhavi, “Visual programming: Composi- tional visual reasoning without training,” inComputer Vision and Pattern Recognition, 2023

2023
[23]

Embodied agent interface: Benchmarking LLMs for embodied decision making,

M. Li, S. Zhao, Q. Wang, K. Wang, Y . Zhou, S. Srivastava, C. Gokmen, T. Lee, E. L. Li, R. Zhanget al., “Embodied agent interface: Benchmarking LLMs for embodied decision making,” inAdvances in Neural Information Processing Systems, 2024

2024
[24]

Claude-3.5 sonnet,

Anthropic, “Claude-3.5 sonnet,” 2025

2025
[25]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Sori- cut, J. Schalkwyk, A. M. Daiet al., “Gemini: A Fam- ily of Highly Capable Multimodal Models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai, “InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,”arXiv preprint arXiv:2312.14238, 2024

work page internal anchor Pith review arXiv 2024
[27]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, Y . Sun, C. Deng, H. Xu, Z. Xie, and C. Ruan, “DeepSeek-VL: Towards Real-World Vision-Language Understanding,”arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review arXiv 2024
[28]

Towards Deep Learning Models Resistant to Adversarial Attacks

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://arxiv.org/abs/1706.06083

work page internal anchor Pith review arXiv 2018
[29]

A learning algorithm for continually running fully recurrent neural networks,

R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,”Neural Computation, vol. 1, no. 2, pp. 270–280, 1989

1989
[30]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” inComputer Vision and Pattern Recognition, 2018

2018
[31]

Feature squeezing: Detecting adversarial examples in deep neural networks,

W. Xu, D. Evans, and Y . Qi, “Feature squeezing: Detecting adversarial examples in deep neural networks,” inNetwork and Distributed Systems Security Symposium, 2018

2018
[32]

Ocrbench: on the hidden mystery of ocr in large multimodal models,

Y . Liu, Z. Li, M. Huang, B. Yang, W. Yu, C. Li, X.-C. Yin, C.- L. Liu, L. Jin, and X. Bai, “Ocrbench: on the hidden mystery of ocr in large multimodal models,”Science China Information Sciences, vol. 67, no. 12, p. 220102, 2024

2024
[33]

Benchmarking vision-language models on optical character recognition in dynamic video environments,

S. Nagaonkar, A. Sharma, A. Choithani, and A. Trivedi, “Benchmarking vision-language models on optical character recognition in dynamic video environments,”arXiv preprint arXiv:2502.06445, 2025

work page arXiv 2025
[34]

Countering adversarial images using input transformations,

C. Guo, M. Rana, M. Cisse, and L. van der Maaten, “Countering adversarial images using input transformations,” inInternational Conference on Learning Representations, 2018

2018
[35]

A self-supervised approach for adversarial robustness,

M. Naseer, S. Khan, M. Hayat, F. S. Khan, and F. Porikli, “A self-supervised approach for adversarial robustness,” in Computer Vision and Pattern Recognition, 2020

2020
[36]

Diffusion models for adversarial purification,

W. Nie, B. Guo, Y . Huang, C. Xiao, A. Vahdat, and A. Anand- kumar, “Diffusion models for adversarial purification,” inInter- national Conference on Machine Learning, 2022

2022
[37]

Densepure: Understanding diffusion models for adversarial robustness,

C. Xiao, Z. Chen, K. Jin, J. Wang, W. Nie, M. Liu, A. Anandku- mar, B. Li, and D. Song, “Densepure: Understanding diffusion models for adversarial robustness,” inInternational Conference on Learning Representations, 2023

2023
[38]

Sok: Unifying cybersecurity and cybersafety of multimodal foundation models with an information theory approach,

R. Sun, J. Chang, H. Pearce, C. Xiao, B. Li, Q. Wu, S. Nepal, and M. Xue, “Sok: Unifying cybersecurity and cybersafety of multimodal foundation models with an information theory approach,”arXiv preprint arXiv:2411.11195, 2024

work page arXiv 2024
[39]

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks,

Y . Liu, Y . Jia, J. Jia, D. Song, and N. Z. Gong, “DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks,” in 2025 IEEE Symposium on Security and Privacy. IEEE, 2025

2025
[40]

Exploring Potential Prompt Injection Attacks in Federated Military LLMs and Their Mitigation

Y . Lee, T. Park, Y . Lee, J. Gong, and J. Kang, “Exploring potential prompt injection attacks in federated military LLMs and their mitigation,”arXiv preprint arXiv:2501.18416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

De- fense against the dark prompts: Mitigating best-of-n jailbreak- ing with prompt evaluation,

S. Armstrong, M. Franklin, C. Stevens, and R. Gorman, “De- fense against the dark prompts: Mitigating best-of-n jailbreak- ing with prompt evaluation,”arXiv preprint arXiv:2502.00580, 2025

work page arXiv 2025
[42]

Towards safe and trustworthy embodied ai: Foundations, status, and prospects,

X. Yang, D. Xu, M. Wen, Z. Wuet al., “Towards safe and trustworthy embodied ai: Foundations, status, and prospects,”
[43]

Available: https://openreview.net/pdf/a3b0e b5349f3c0dd92e21b43b04037add70c669a.pdf

[Online]. Available: https://openreview.net/pdf/a3b0e b5349f3c0dd92e21b43b04037add70c669a.pdf
[44]

Easyocr: Ready-to-use ocr with 80+ supported lan- guages,

JaidedAI, “Easyocr: Ready-to-use ocr with 80+ supported lan- guages,” https://github.com/JaidedAI/EasyOCR