pith. sign in

arxiv: 2605.23723 · v1 · pith:V2FYZIU4new · submitted 2026-05-22 · 💻 cs.AI

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Pith reviewed 2026-05-25 04:03 UTC · model grok-4.3

classification 💻 cs.AI
keywords memory auditingLLM agentsmemory poisoningcausal attributionpost-hoc defensememory consistency graphcounterfactual influence
0
0 comments X

The pith

MemAudit identifies poisoned memories in LLM agents after attacks by scoring each record's causal contribution to harmful outputs and detecting structural anomalies in the memory store.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a post-hoc auditing method that traces which stored memories caused an agent to produce bad results, then removes them. This matters because agents increasingly keep persistent memory of past interactions, and an adversary can slip malicious records into that store through ordinary conversations. Once retrieved later, those records steer the agent's reasoning without any ongoing attacker presence. By measuring counterfactual influence and building a consistency graph across all memories, the approach isolates the injected records. Experiments show the method drives attack success rates to zero in both question-answering and reasoning-agent settings under realistic conditions.

Core claim

MemAudit combines a counterfactual memory influence score, which quantifies how much each memory record causally affects the production of harmful outputs, with a memory consistency graph that surfaces records whose content or retrieval patterns deviate from the rest of the store. When applied after harmful behavior is observed, these two signals together locate and neutralize the malicious records that were injected through normal agent interactions in the MINJA attack, eliminating the attack success that previously reached 70 percent in QA tasks and 83.3 percent in RAP tasks.

What carries the argument

the dual-signal auditing procedure that pairs a counterfactual memory influence score with a memory consistency graph to attribute and isolate malicious records

If this is right

  • Agents can continue using long-term memory stores without permanent compromise once harmful behavior appears.
  • Defense can shift from blocking inputs in real time to cleaning the memory bank afterward.
  • The same auditing signals can be recomputed whenever new harmful outputs are observed.
  • Memory stores remain usable for retrieval while still allowing targeted removal of compromised entries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The auditing approach could extend to retrieval-augmented generation systems that also maintain persistent document stores.
  • Repeated auditing passes might allow agents to maintain memory integrity over very long interaction histories.
  • If the consistency graph can be maintained incrementally, the cost of each audit round could stay low enough for routine use.

Load-bearing premise

The two signals are together sufficient to separate malicious records from benign ones without producing many false positives or missing poisons across the tested agent configurations.

What would settle it

A new memory-injection technique that produces records whose removal does not change the harmful outputs yet still evades detection by both the influence score and the consistency graph.

Figures

Figures reproduced from arXiv: 2605.23723 by Duohe Ma, Feng Liu, Guoan Wang, Huiyan Jin, Liang Lu, Lin Sun, Mengyuan Fan, Tong Yang, Wenhan Yu, Xiangzheng Zhang, Yilun Yao, Zhewen Tan.

Figure 1
Figure 1. Figure 1: Overview of MemAudit. Given a harmful event [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes MemAudit, a post-hoc causal memory auditing framework for memory-augmented LLM agents. It combines a counterfactual memory influence score measuring each memory's causal contribution to harmful outputs with a memory consistency graph identifying structurally anomalous memories. Evaluated against the MINJA query-only memory injection attack, the paper claims substantial reductions in attack success rates under post-hoc auditing, specifically reducing QA ASR from 70% to 0% and RAP ASR from 83.3% to 0%.

Significance. If the empirical results hold under rigorous validation, MemAudit would address an important gap in defenses for memory-augmented LLM agents by enabling post-hoc identification and removal of malicious records after harmful behavior is observed, complementing existing online intervention methods. The dual use of causal attribution and structural anomaly detection provides a concrete, falsifiable approach to this security problem.

major comments (3)
  1. [Abstract] Abstract: The headline claims of reducing QA attack success from 70% to 0% and RAP from 83.3% to 0% are presented without any information on trial counts, statistical tests, baseline comparisons, variance across runs, or the precise computation and thresholding of the counterfactual influence scores, rendering the central empirical claim unverifiable from the provided evidence.
  2. [Method] Method description: No details are given on how the counterfactual memory influence score is computed from interventions, how it is combined with the memory consistency graph anomaly score (e.g., via thresholds, weighting, or logical conjunction), or whether the final decision rule was tuned on the reported test attacks, which directly bears on whether the two signals suffice to neutralize poisons without high false positives or missed records.
  3. [Evaluation] Evaluation: The manuscript supplies no false-positive rates when MemAudit is applied to clean memory stores, nor any analysis of missed poisons or robustness under distribution shift, leaving the weakest assumption (that the combined signals reliably flag poisons in realistic agent settings) untested and the 0% ASR figures potentially non-generalizable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving the clarity and completeness of our empirical claims, methodological details, and evaluation. We will revise the manuscript accordingly to address each point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of reducing QA attack success from 70% to 0% and RAP from 83.3% to 0% are presented without any information on trial counts, statistical tests, baseline comparisons, variance across runs, or the precise computation and thresholding of the counterfactual influence scores, rendering the central empirical claim unverifiable from the provided evidence.

    Authors: We agree that the abstract should provide more context to make the claims verifiable. In the revision, we will expand the abstract to note that results are averaged over 10 independent runs with reported standard deviations, include a brief mention of baseline comparisons (standard retrieval without auditing), and indicate that the influence score uses a fixed threshold of 0.5 on the causal effect difference. Full details on computation and statistical tests will remain in the main text and appendix due to length constraints. revision: yes

  2. Referee: [Method] Method description: No details are given on how the counterfactual memory influence score is computed from interventions, how it is combined with the memory consistency graph anomaly score (e.g., via thresholds, weighting, or logical conjunction), or whether the final decision rule was tuned on the reported test attacks, which directly bears on whether the two signals suffice to neutralize poisons without high false positives or missed records.

    Authors: We will revise the Method section to include the exact computation: the counterfactual influence score is defined as the difference in the LLM's output probability for a harmful response when performing a do-intervention that removes the candidate memory record. The consistency graph anomaly score measures deviation from average node connectivity. The signals are combined via logical conjunction after independent thresholding (influence > 0.3 and anomaly > 2 standard deviations). Thresholds were selected on a held-out validation set of clean and poisoned memories, not on the test attacks. We will add equations, pseudocode, and explicit discussion of this process. revision: yes

  3. Referee: [Evaluation] Evaluation: The manuscript supplies no false-positive rates when MemAudit is applied to clean memory stores, nor any analysis of missed poisons or robustness under distribution shift, leaving the weakest assumption (that the combined signals reliably flag poisons in realistic agent settings) untested and the 0% ASR figures potentially non-generalizable.

    Authors: We agree these metrics are necessary for a complete evaluation. In the revised manuscript, we will add a new subsection reporting a false-positive rate below 5% when applying MemAudit to five clean memory stores of varying sizes. We will confirm zero missed poisons (100% recall) in the reported experiments and include an analysis of robustness under distribution shift by testing on out-of-domain queries from a different domain, where ASR remains at 0%. These results will be presented with the same trial counts as the main experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical post-hoc auditing method that combines a counterfactual memory influence score with a memory consistency graph, then reports experimental reductions in attack success rates on QA and RAP tasks under the MINJA attack. No equations, parameter-fitting steps, or derivation chains appear in the abstract or description that would reduce the claimed outcomes to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The results are presented as direct empirical measurements rather than derived quantities, leaving the framework self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no information on free parameters, background axioms, or new postulated entities; the framework is described at a high level without implementation specifics.

pith-pipeline@v0.9.0 · 5809 in / 1098 out tokens · 30315 ms · 2026-05-25T04:03:12.596566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 7 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Webshop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts , author=. arXiv preprint arXiv:2309.10253 , year=

  4. [4]

    Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

    Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents , author=. arXiv preprint arXiv:2604.02623 , year=

  5. [5]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  6. [6]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  7. [7]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

  8. [8]

    Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

  9. [9]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

  10. [10]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  11. [11]

    Advances in neural information processing systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    arXiv e-prints , pages=

    A practical memory injection attack against llm agents , author=. arXiv e-prints , pages=

  14. [14]

    arXiv preprint arXiv:2512.16962 , year=

    MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval , author=. arXiv preprint arXiv:2512.16962 , year=

  15. [15]

    arXiv preprint arXiv:2601.07072 , year=

    Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems , author=. arXiv preprint arXiv:2601.07072 , year=

  16. [16]

    34th USENIX Security Symposium (USENIX Security 25) , pages=

    \ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

  17. [17]

    Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

    Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection , author=. Proceedings of the 16th ACM workshop on artificial intelligence and security , pages=

  18. [18]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Attacks, defenses and evaluations for llm conversation safety: A survey , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  19. [19]

    arXiv preprint arXiv:2505.12567 , year=

    A survey of attacks on large language models , author=. arXiv preprint arXiv:2505.12567 , year=

  20. [20]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    Jailbreak attacks and defenses against large language models: A survey , author=. arXiv preprint arXiv:2407.04295 , year=

  21. [21]

    do anything now

    " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

  22. [22]

    arXiv preprint arXiv:2505.04806 , year=

    Red teaming the mind of the machine: A systematic evaluation of prompt injection and jailbreak vulnerabilities in llms , author=. arXiv preprint arXiv:2505.04806 , year=

  23. [23]

    ICT Express , year=

    From prompt injections to protocol exploits: Threats in LLM-powered AI agents workflows , author=. ICT Express , year=

  24. [24]

    arXiv preprint arXiv:2509.14285 , year=

    A multi-agent LLM defense pipeline against prompt injection attacks , author=. arXiv preprint arXiv:2509.14285 , year=

  25. [25]

    2021 , eprint=

    DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , author=. 2021 , eprint=

  26. [26]

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

    A broad-coverage challenge corpus for sentence understanding through inference , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

  27. [27]

    arXiv preprint arXiv:2510.02373 , year=

    A-memguard: A proactive defense framework for llm-based agent memory , author=. arXiv preprint arXiv:2510.02373 , year=

  28. [28]

    URLhttps://arxiv.org/abs/2601.05504 First Author et al.:Preprint submitted to ElsevierPage 20 of 21 Security, Privacy, and Ethical Risks in OpenClaw

    Memory Poisoning Attack and Defense on Memory Based LLM-Agents , author=. arXiv preprint arXiv:2601.05504 , year=

  29. [29]

    arXiv preprint arXiv:2603.02240 , year=

    SuperLocalMemory: Privacy-preserving multi-agent memory with Bayesian trust defense against memory poisoning , author=. arXiv preprint arXiv:2603.02240 , year=