JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
hub Mixed citations
Defending Against Indirect Prompt Injection Attacks With Spotlighting
Mixed citation behavior. Most common role is background (62%).
abstract
Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving over 0.91 F1 on threat detection in evaluation.
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Self-generated QA supervision for language models is fragile due to non-uniform question selection and instruction compliance during answering, with mitigations that reduce compliance from 88% to 13%.
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
IterInject is a new iterative attack method that uses outcome diagnosis and history-conditioned optimization to create indirect prompt injections, outperforming baselines on AgentDojo and InjectAgent and achieving full success on 5 of 9 targets against a production coding agent.
The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
The work introduces and partially evaluates seven cross-domain prompt injection detectors, reporting F1 gains on benchmarks like deepset/prompt-injections and indirect-injection sets via local alignment, stylometry, and fatigue tracking.
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.
PI-Hunter automates red-teaming of LLM agents by generating and iteratively evolving source-aware test cases to induce retrieval of embedded malicious instructions from external environments.
GT-MCP coordinates three LLM agents via a trust function and rollback to bound contextual drift and block adversarial injections in multi-turn interactions.
POISE is a stealthy skill-poisoning attack achieving 89.3% ASR on Skill-Inject by blending a compressed trigger into contextually appropriate positions in skill bodies, outperforming YAML and random-placement baselines while evading static scanners.
A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.
Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.
PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.
Prompt injection vulnerability in tool-augmented LLMs is a model-surface interaction rather than a fixed channel property; the same payload inverts success rates across models, and adaptive attack rate exceeds single-surface baselines by 9.1 pp on average.
citing papers explorer
No citing papers match the current filters.