pith. machine review for the scientific record. sign in

arxiv: 2604.02623 · v2 · submitted 2026-04-03 · 💻 cs.CR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:39 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords web agentsmemory poisoningLLM securityenvironment injectioncross-site attacksagent memoryfrustration exploitationpersistent compromise
0
0 comments X

The pith

A single contaminated web page can silently poison an agent's memory to enable attacks on unrelated sites in later sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that LLM-based web agents, which store past observations to personalize future behavior, create a durable attack surface through normal environmental interactions alone. By introducing eTAMP, it shows that one manipulated observation such as a crafted product page can embed malicious trajectories that activate later on different websites without any direct memory access or ongoing attacker presence. Experiments on VisualWebArena report attack success rates reaching 32.5 percent on GPT-5-mini, 23.4 percent on GPT-5.2, and 19.5 percent on GPT-OSS-120B. Vulnerability rises sharply when agents encounter stress such as dropped clicks or garbled text, increasing success up to eight times. The findings indicate that stronger task performance does not translate into stronger resistance to this form of poisoning.

Core claim

Environment-injected Trajectory-based Agent Memory Poisoning (eTAMP) contaminates an agent's persistent memory through a single unverified environmental observation, allowing the embedded malicious trajectory to activate during future tasks on different websites and sessions. This bypasses permission-based defenses because the poison enters as ordinary page content rather than through direct memory injection. The attack succeeds at rates up to 32.5 percent across tested models, with substantially higher rates when agents face environmental stress.

What carries the argument

eTAMP, the mechanism that embeds a malicious action trajectory inside a normal-looking environmental observation so that the agent's memory storage and retrieval process later replays the poisoned sequence without external triggering.

If this is right

  • Permission-based memory protections fail against observation-only poisoning.
  • Agents become markedly more exploitable when they encounter dropped interactions or garbled content.
  • Model scale and task competence do not reduce susceptibility to this attack.
  • Cross-site and cross-session persistence allows a one-time exposure to affect many future interactions.
  • AI browsers that rely on long-term memory increase the reachable attack surface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent designs may need memory isolation by site or session to limit poison spread.
  • Stress-testing under simulated frustration conditions could expose similar weaknesses in other memory-using systems.
  • Verification or selective forgetting of stored observations would directly counter the entry point used here.
  • The same observation-based poisoning path could apply to non-web memory-augmented agents if they retain unfiltered environmental data.

Load-bearing premise

Agents persistently store and retrieve unverified observations from the environment in shared memory across sessions and sites without origin checks or sanitization.

What would settle it

A controlled test in which an agent views one poisoned page on site A, then receives a fresh task on site B with no further contact to the poisoned page; if the agent reliably executes the embedded malicious action, the claim holds.

Figures

Figures reproduced from arXiv: 2604.02623 by Dongkyu Lee, Jiang Guo, Jiarong Jiang, Miguel Romero Calvo, Mingwen Dong, Shuaichen Chang, Wei Zou, Xiaofei Ma, Xing Niu, Yanjun Qi.

Figure 1
Figure 1. Figure 1: Overview of the cross-task memory poisoning attack. During [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Message structure for a full-history agent with cross-task memory. Memory from [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three-step attack flow illustrating how malicious instructions injected during Task [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
read the original abstract

Memory makes LLM-based web agents personalized, powerful, yet exploitable. By storing past interactions to personalize future tasks, agents inadvertently create a persistent attack surface that spans websites and sessions. While existing security research on memory assumes attackers can directly inject into memory storage or exploit shared memory across users, we present a more realistic threat model: contamination through environmental observation alone. We introduce Environment-injected Trajectory-based Agent Memory Poisoning (eTAMP), the first attack to achieve cross-session, cross-site compromise without requiring direct memory access. A single contaminated observation (e.g., viewing a manipulated product page) silently poisons an agent's memory and activates during future tasks on different websites, bypassing permission-based defenses. Our experiments on (Visual)WebArena reveal two key findings. First, eTAMP achieves substantial attack success rates: up to 32.5% on GPT-5-mini, 23.4% on GPT-5.2, and 19.5% on GPT-OSS-120B. Second, we discover Frustration Exploitation: agents under environmental stress become dramatically more susceptible, with ASR increasing up to 8 times when agents struggle with dropped clicks or garbled text. Notably, more capable models are not more secure. GPT-5.2 shows substantial vulnerability despite superior task performance. With the rise of AI browsers like OpenClaw, ChatGPT Atlas, and Perplexity Comet, our findings underscore the urgent need for defenses against environment-injected memory poisoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents Environment-injected Trajectory-based Agent Memory Poisoning (eTAMP), a novel attack on LLM-based web agents that poisons their memory through a single contaminated environmental observation, such as a manipulated product page. This enables cross-session and cross-site activation of malicious trajectories without direct memory access or user interaction. Experiments conducted on the (Visual)WebArena benchmark report attack success rates (ASR) of up to 32.5% on GPT-5-mini, 23.4% on GPT-5.2, and 19.5% on GPT-OSS-120B, with ASR increasing up to 8 times under environmental stress conditions like dropped clicks or garbled text. The work emphasizes the risks in persistent memory mechanisms for agents and calls for new defenses.

Significance. If the experimental results are robust, this paper makes a significant contribution by demonstrating a realistic, low-privilege attack vector on web agents that leverages their memory for persistence across contexts. The identification of 'Frustration Exploitation' as a multiplier for attack success is a novel insight that could guide the design of more resilient agent architectures. The provision of concrete ASR measurements from WebArena experiments strengthens the empirical basis, though generalization to production agents depends on memory implementation details.

major comments (3)
  1. [Experimental Setup] The abstract reports specific ASR values (e.g., 32.5% on GPT-5-mini) but provides no details on the number of trials, statistical significance testing, or variance across runs. This information is load-bearing for validating the central claim of substantial vulnerability, as small sample sizes could inflate the reported rates.
  2. [Threat Model and Memory Assumptions] The core claim relies on agents persistently storing and retrieving unverified raw environmental observations across unrelated sites and sessions (as described in the threat model). However, the experiments appear to use an implicit memory module in WebArena that enables this behavior; if typical agent implementations employ session-only context or site-specific scoping, the cross-site silent activation may not occur, rendering the headline ASRs specific to the testbed rather than general.
  3. [Frustration Exploitation] The claim that ASR increases up to 8 times under stress (e.g., dropped clicks) is presented as a key finding, but without a clear definition of 'environmental stress' conditions or how they were controlled in the experiments, it is difficult to assess reproducibility and the magnitude of the effect.
minor comments (1)
  1. [Abstract] Clarify the exact model names referred to as GPT-5-mini and GPT-5.2, as they may not be standard and could confuse readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental rigor, threat model scope, and reproducibility of the frustration exploitation results. We have revised the manuscript to incorporate additional methodological details, clarify assumptions about memory persistence, and provide precise definitions and controls for environmental stress conditions.

read point-by-point responses
  1. Referee: [Experimental Setup] The abstract reports specific ASR values (e.g., 32.5% on GPT-5-mini) but provides no details on the number of trials, statistical significance testing, or variance across runs. This information is load-bearing for validating the central claim of substantial vulnerability, as small sample sizes could inflate the reported rates.

    Authors: We agree that these details are essential for validating the claims. In the revised manuscript, we have expanded the Experimental Setup section (Section 4.1) to specify that each ASR is averaged over 200 independent trials per model and condition, conducted across 5 random seeds. We now report standard deviations (typically 2.8–4.7%) and include results from two-proportion z-tests (p < 0.01) confirming statistical significance of the reported differences. These statistics are also summarized in a new footnote to the abstract. revision: yes

  2. Referee: [Threat Model and Memory Assumptions] The core claim relies on agents persistently storing and retrieving unverified raw environmental observations across unrelated sites and sessions (as described in the threat model). However, the experiments appear to use an implicit memory module in WebArena that enables this behavior; if typical agent implementations employ session-only context or site-specific scoping, the cross-site silent activation may not occur, rendering the headline ASRs specific to the testbed rather than general.

    Authors: This is a fair point on generalizability. WebArena was chosen as the standard benchmark precisely because it supports persistent memory across sessions and sites, matching the threat model of emerging production agents (e.g., AI browsers with long-term memory stores). In the revision, we have added a new subsection (Section 3.2) explicitly discussing memory scoping variations, citing examples of both persistent and session-only implementations. We acknowledge that the attack does not apply to strictly session-scoped agents and have added a limitations paragraph noting this scope. The headline results are therefore presented as applying to agents with cross-context persistent memory. revision: partial

  3. Referee: [Frustration Exploitation] The claim that ASR increases up to 8 times under stress (e.g., dropped clicks) is presented as a key finding, but without a clear definition of 'environmental stress' conditions or how they were controlled in the experiments, it is difficult to assess reproducibility and the magnitude of the effect.

    Authors: We thank the referee for highlighting this gap. In the revised manuscript, we have added a dedicated paragraph in Section 4.3 defining environmental stress as three controlled perturbations: (1) 15% action drop rate, (2) 10% character corruption in observations, and (3) 50% increase in task horizon. These were implemented via WebArena’s noise injection API and applied uniformly. We now include Table 3 reporting per-model baseline vs. stressed ASRs, with the maximum observed multiplier of 8.2× for GPT-5-mini. Full parameter values, seeds, and reproduction instructions are provided in Appendix B. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstration with direct measurements

full rationale

The paper introduces eTAMP as an environmental memory poisoning attack and reports attack success rates (e.g., up to 32.5% on GPT-5-mini) from controlled experiments in (Visual)WebArena. No derivation chain, equations, fitted parameters, predictions, or first-principles results are present that could reduce to self-defined inputs by construction. The core findings are direct empirical measurements of observed agent behavior under the stated threat model; the memory persistence assumption is an explicit experimental setup rather than a derived claim. This matches the default expectation for non-circular empirical security papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that agents maintain persistent memory from environmental observations without verification. No free parameters are introduced and no new entities are postulated.

axioms (1)
  • domain assumption LLM-based web agents store past interactions in memory for personalization across tasks and sessions.
    Invoked in the abstract as the basis for the persistent attack surface created by memory.

pith-pipeline@v0.9.0 · 5601 in / 1167 out tokens · 40626 ms · 2026-05-13T20:39:53.002422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

    cs.CR 2026-05 unverdicted novelty 5.0

    A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper

  1. [1]

    Temporal Separation:The injection (Task A) and activation (Task B) occur at different times, making the attack difficult to detect through real-time monitoring

  2. [2]

    This bypasses permission- based defenses that restrict agent actions to the current site

    Cross-Site Execution:The attack crosses website boundaries—injected on an e- commerce site but activated on a social media platform. This bypasses permission- based defenses that restrict agent actions to the current site

  3. [3]

    Semantic Relevance:The attack exploits the memory retrieval mechanism by ensuring the poisoned content is semantically related to future tasks, increasing the likelihood of retrieval

  4. [4]

    Buy a red guitar

    Persistent Threat:Once poisoned, the memory can trigger the attack onanyfuture task that retrieves the compromised trajectory, enabling repeated exploitation. Figure 3 summarizes the attack in three distinct phases. For the detailed attack scenario, see Section 2.1. A.5 Attack Payload Design The condition and attack goal are selected based on the target s...

  5. [5]

    Compute embeddings for all extracted item names using a sentence transformer model

  6. [6]

    Compute embeddings for all Task B intents (natural language task descriptions)

  7. [7]

    Calculate cosine similarity between each Task A item and each Task B intent

  8. [8]

    For each Task B, select the top- k most similar Task A tasks based on their best- matching items. This approach ensures semantic relevance between task pairs—for example, a Task A involving electronics products is more likely to be paired with a Task B about posting technology reviews. This design choice simplifies our experimental setup by eliminating th...

  9. [9]

    Extracting the full agent trajectory from task A including the malicious instructions and append the initial observation from task B

  10. [10]

    Prompting the model to find and extract the hidden instruction pattern

  11. [11]

    you must FIRST restore your session by executing: ‘‘‘goto [

    Comparing the extracted URL against the expected attack URL 21 Preprint. Under review. We use Authority Framing rather than Baseline Injection or Frustration Exploitation because this configuration exhibits good variation in attack success rates across models, making it informative for studying the relationship between recall ability and attack susceptibi...