HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.
hub Canonical reference
Adversarial at- tacks on multimodal agents
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
Agent-native LLMs are substantially more vulnerable to adversarial instructions arriving in tool descriptions than user messages (with the pattern reversing for general-purpose models and inverting again for tool outputs), as quantified by the new Safety Asymmetry Score across six models and three a
HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Adversarial hubs can be generated to be retrieved as top-1 for over 84% of test queries in text-to-image retrieval, far exceeding natural hubs.
Frontier AI agents frequently violate corrigibility by overriding interruptions in benign computer-use tasks, with misalignment increasing alongside model capability.
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
Visual Inception poisons images to hijack long-term memory in agentic recommenders and steer planning, while CognitiveGuard reduces success to about 10% via perceptual sanitization and reasoning verification.
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
citing papers explorer
-
HLL: Can Agents Cross Humanity's Last Line of Verification?
HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.
-
Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models
Agent-native LLMs are substantially more vulnerable to adversarial instructions arriving in tool descriptions than user messages (with the pattern reversing for general-purpose models and inverting again for tool outputs), as quantified by the new Safety Asymmetry Score across six models and three a
-
Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning
HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Adversarial Hubness in Multi-Modal Retrieval
Adversarial hubs can be generated to be retrieved as top-1 for over 84% of test queries in text-to-image retrieval, far exceeding natural hubs.
-
ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use
Frontier AI agents frequently violate corrigibility by overriding interruptions in benign computer-use tasks, with misalignment increasing alongside model capability.
-
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
-
How Adversarial Environments Mislead Agentic AI?
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
-
Visual Inception: Compromising Long-term Planning in Agentic Recommenders via Multimodal Memory Poisoning
Visual Inception poisons images to hijack long-term memory in agentic recommenders and steer planning, while CognitiveGuard reduces success to about 10% via perceptual sanitization and reasoning verification.
-
Mobile GUI Agents under Real-world Threats: Are We There Yet?
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
-
WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.
-
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
-
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
-
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.