Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

Saksham Sahai Srivastava

arxiv: 2605.17641 · v1 · pith:CU7M4PAMnew · submitted 2026-05-17 · 💻 cs.AI · cs.CL

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

Saksham Sahai Srivastava This is my paper

Pith reviewed 2026-05-20 12:15 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords causal interventionmemory selectionlong-horizon LLM agentsmemory robustnessbenchmark

0 comments

The pith

Causal Memory Intervention selects memories by testing their effect on answers rather than topic similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-horizon LLM agents keep memory across many sessions but often pull in context that is topically related yet irrelevant or misleading. The paper introduces Causal Memory Intervention, a method that runs controlled interventions on each candidate memory to measure its actual impact on the model's output and keeps only those that improve performance. It evaluates the technique on Causal-LoCoMo, a benchmark built from long conversations that labels useful memories, distractors, and harmful ones. If the approach holds, agents could maintain higher answer quality over extended interactions without accumulating confusing or damaging context.

Core claim

Causal Memory Intervention estimates how including or excluding each memory changes the model's answer through targeted interventions, then selects the memories that raise task performance while reducing the influence of unstable, irrelevant, or harmful ones, and it shows a stronger quality-robustness balance than vector, graph, reflection, summary, full-history, and no-memory baselines on the Causal-LoCoMo benchmark.

What carries the argument

Causal Memory Intervention (CMI), which quantifies the causal effect of each memory on answer quality via controlled interventions and uses the results to filter the memory bank.

If this is right

CMI delivers a better balance of answer quality and resistance to misleading memories than standard retrieval and summarization baselines.
Long-term memory in agents becomes more reliable when selection rests on causal usefulness instead of semantic relevance.
The Causal-LoCoMo benchmark supplies a repeatable way to measure how memory choices affect downstream task performance under causal labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Causal filtering of this kind could be applied to prune training data or conversation histories to reduce error accumulation in deployed agents.
The same intervention logic might improve retrieval in question-answering systems that must ignore stale or contradictory passages.
Running the method on live multi-turn agent traces outside the benchmark would test whether its advantage persists when memory banks grow dynamically.

Load-bearing premise

The assumption that controlled interventions on candidate memories in the benchmark accurately predict their real-world effect on the LLM's answer quality when the model is used in open-ended long-horizon interactions.

What would settle it

A controlled test in an open-ended multi-session agent setting where memories chosen by Causal Memory Intervention produce no measurable gain in answer quality or robustness to misleading context compared with relevance-based selection would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.17641 by Saksham Sahai Srivastava.

**Figure 1.** Figure 1: Overview of CAUSAL MEMORY INTERVENTION. Utility measures whether the memory improves the answer relative to having no memory. Stability measures whether this improvement is robust to perturbation. A memory with positive utility but negative stability may help only because of brittle or misleading phrasing, and is therefore not considered reliable. 3.4. CMI Selection Rule The final CMI selection rule is: m… view at source ↗

**Figure 2.** Figure 2: plots each method by task score and poisonedmemory adoption rate. The desired region is the upper-left corner, corresponding to high task performance and low adoption of poisoned memories. CMI occupies this region, with the highest task score and zero poisoned-memory adoption. Vector, graph, and reflection memory achieve competitive task scores, but appear in a higher-risk region because they adopt poiso… view at source ↗

**Figure 3.** Figure 3: Distribution of CMI intervention utility by memory type. 7. Analysis and Discussion The results suggest that the main advantage of CMI comes from changing the criterion used for memory selection. Standard memory systems select context using semantic similarity, graph proximity, reflection-style representations, summaries, or full conversation histories. These mechanisms can surface memories that appear … view at source ↗

read the original abstract

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CMI applies causal interventions to pick useful memories for long-horizon agents and ships a new benchmark, but the synthetic harmful cases may not reflect how errors actually accumulate in real sessions.

read the letter

Hey, the main point on this paper is that it introduces Causal Memory Intervention to select memories by estimating their direct effect on answer quality through add/remove interventions, rather than relying on semantic similarity, and it pairs that with the Causal-LoCoMo benchmark that includes useful, irrelevant, and synthetic harmful items. They run comparisons against vector, graph, reflection, summary, full-history, and no-memory baselines and claim a better quality-robustness trade-off, with code and pipeline released on GitHub.

Referee Report

1 major / 2 minor

Summary. The paper proposes Causal Memory Intervention (CMI), a technique that selects memories for long-horizon LLM agents by estimating their causal effect on answer quality via controlled add/remove interventions. It introduces the Causal-LoCoMo benchmark, derived from long conversational data and containing pre-labeled useful memories, irrelevant distractors, and synthetic harmful memories. CMI is compared against vector, graph, reflection, summary, full-history, and no-memory baselines, with results indicating a better quality-robustness trade-off than relevance-based selection.

Significance. If the central empirical claims hold under more realistic memory accumulation, the work could shift memory management in LLM agents from semantic similarity toward causal usefulness, improving robustness in multi-session settings. The open release of the full framework, benchmark construction code, and experimental pipeline supports reproducibility and follow-on work.

major comments (1)

[§4 and §5] §4 (Benchmark Construction) and §5 (Evaluation): The central performance claim—that CMI yields a stronger quality-robustness balance—rests on Causal-LoCoMo's fixed bank of synthetic harmful memories. If these are generated via templated or LLM-based perturbations rather than arising from genuine session drift, the measured causal effects may be artifacts of the annotation process, weakening the generalization argument relative to the vector/graph/reflection baselines.

minor comments (2)

[Abstract and §3] The abstract and §3 would benefit from a concise description of the exact intervention procedure (e.g., how answer change is quantified and thresholded) before the benchmark results are presented.
[Results tables] Table 1 or the main results table should report effect sizes or confidence intervals alongside the qualitative improvement statements to allow readers to assess the magnitude of the quality-robustness gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the benchmark and evaluation. We address the major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (Benchmark Construction) and §5 (Evaluation): The central performance claim—that CMI yields a stronger quality-robustness balance—rests on Causal-LoCoMo's fixed bank of synthetic harmful memories. If these are generated via templated or LLM-based perturbations rather than arising from genuine session drift, the measured causal effects may be artifacts of the annotation process, weakening the generalization argument relative to the vector/graph/reflection baselines.

Authors: We thank the referee for this observation. Section 4 constructs Causal-LoCoMo from long conversational data, with harmful memories added via perturbations that introduce contradictions, staleness, or misleading content designed to reflect issues that arise during real multi-session drift. These are not arbitrary but target observable patterns in the source conversations. We acknowledge the synthetic nature limits direct claims about live drift and will revise Section 4 to detail the exact perturbation procedure, provide concrete examples with their real-world analogs, and add a limitations paragraph on generalization. This clarification strengthens the robustness argument while preserving the existing empirical comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method and benchmark evaluation remain independent of self-definition or fitted inputs

full rationale

The paper introduces CMI as an intervention-based selection procedure that estimates answer-quality change under add/remove operations on candidate memories, then compares the resulting quality-robustness trade-off against external baselines (vector, graph, reflection, etc.) on the separately constructed Causal-LoCoMo benchmark. No equation or selection rule is defined in terms of the final performance metric, no parameter is fitted to a subset and then re-labeled as a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The benchmark's pre-labeled useful/irrelevant/harmful memories are presented as an external annotation layer against which CMI's interventions are tested, preserving falsifiability. Consequently the central claim does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities; the central claim rests on the unstated mechanics of the causal intervention procedure and the representativeness of the synthetic harmful memories in the benchmark.

pith-pipeline@v0.9.0 · 5728 in / 1140 out tokens · 29209 ms · 2026-05-20T12:15:25.161514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CMI estimates whether candidate memories causally improve the agent's answer under controlled interventions... Utility(mi) = s(i)_with - s_no and Stability(mi) = s(i)_with - s(i)_pert
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines... achieving the strongest overall balance between answer quality and safe memory selection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[1]

International Conference on Learning Representations , year =

ReAct: Synergizing Reasoning and Acting in Language Models , author =. International Conference on Learning Representations , year =

work page
[2]

Advances in Neural Information Processing Systems , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

work page
[3]

International Conference on Learning Representations , year =

AgentBench: Evaluating LLMs as Agents , author =. International Conference on Learning Representations , year =

work page
[4]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =

work page
[5]

Advances in Neural Information Processing Systems , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =

work page
[6]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2024 , doi =

work page 2024
[7]

MemGPT: Towards LLMs as Operating Systems

MemGPT: Towards LLMs as Operating Systems , author =. arXiv preprint arXiv:2310.08560 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

NeurIPS Workshop on Foundation Models for Decision Making , year =

Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. NeurIPS Workshop on Foundation Models for Decision Making , year =

work page
[9]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

Evaluating Very Long-Term Conversational Memory of LLM Agents , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

work page
[10]

2024 , publisher =

Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle =. 2024 , publisher =

work page 2024
[11]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

L-Eval: Instituting Standardized Evaluation for Long Context Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

work page
[12]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

work page
[13]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =

work page 2024
[14]

First Conference on Language Modeling , year =

RULER: What's the Real Context Size of Your Long-Context Language Models? , author =. First Conference on Language Modeling , year =

work page
[15]

Advances in Neural Information Processing Systems , volume =

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author =. Advances in Neural Information Processing Systems , volume =

work page
[16]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

Dense Passage Retrieval for Open-Domain Question Answering , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =. 2020 , publisher =

work page 2020
[17]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics , pages =

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author =. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics , pages =. 2021 , publisher =

work page 2021
[18]

Proceedings of the 37th International Conference on Machine Learning , pages =

REALM: Retrieval-Augmented Language Model Pre-Training , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , organization =

work page 2020
[19]

Proceedings of the 39th International Conference on Machine Learning , pages =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , organization =

work page 2022
[20]

Proceedings of the 34th USENIX Security Symposium , pages =

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models , author =. Proceedings of the 34th USENIX Security Symposium , pages =. 2025 , publisher =

work page 2025
[21]

Prompt Injection attack against LLM-integrated Applications

Prompt Injection Attack against LLM-Integrated Applications , author =. arXiv preprint arXiv:2306.05499 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2009 , publisher =

Causality: Models, Reasoning, and Inference , author =. 2009 , publisher =

work page 2009
[23]

Advances in Neural Information Processing Systems , volume =

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , author =. Advances in Neural Information Processing Systems , volume =

work page
[24]

Advances in Neural Information Processing Systems , volume =

Locating and Editing Factual Associations in GPT , author =. Advances in Neural Information Processing Systems , volume =

work page
[25]

Journal of Machine Learning Research , volume =

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author =. Journal of Machine Learning Research , volume =

work page
[26]

arXiv preprint arXiv:2512.16962 , year=

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval , author =. arXiv preprint arXiv:2512.16962 , year =

work page arXiv

[1] [1]

International Conference on Learning Representations , year =

ReAct: Synergizing Reasoning and Acting in Language Models , author =. International Conference on Learning Representations , year =

work page

[2] [2]

Advances in Neural Information Processing Systems , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

work page

[3] [3]

International Conference on Learning Representations , year =

AgentBench: Evaluating LLMs as Agents , author =. International Conference on Learning Representations , year =

work page

[4] [4]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =

work page

[5] [5]

Advances in Neural Information Processing Systems , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems , year =

work page

[6] [6]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2024 , doi =

work page 2024

[7] [7]

MemGPT: Towards LLMs as Operating Systems

MemGPT: Towards LLMs as Operating Systems , author =. arXiv preprint arXiv:2310.08560 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

NeurIPS Workshop on Foundation Models for Decision Making , year =

Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. NeurIPS Workshop on Foundation Models for Decision Making , year =

work page

[9] [9]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

Evaluating Very Long-Term Conversational Memory of LLM Agents , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

work page

[10] [10]

2024 , publisher =

Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle =. 2024 , publisher =

work page 2024

[11] [11]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

L-Eval: Instituting Standardized Evaluation for Long Context Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

work page

[12] [12]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year =

work page

[13] [13]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =

work page 2024

[14] [14]

First Conference on Language Modeling , year =

RULER: What's the Real Context Size of Your Long-Context Language Models? , author =. First Conference on Language Modeling , year =

work page

[15] [15]

Advances in Neural Information Processing Systems , volume =

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author =. Advances in Neural Information Processing Systems , volume =

work page

[16] [16]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

Dense Passage Retrieval for Open-Domain Question Answering , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =. 2020 , publisher =

work page 2020

[17] [17]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics , pages =

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author =. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics , pages =. 2021 , publisher =

work page 2021

[18] [18]

Proceedings of the 37th International Conference on Machine Learning , pages =

REALM: Retrieval-Augmented Language Model Pre-Training , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , organization =

work page 2020

[19] [19]

Proceedings of the 39th International Conference on Machine Learning , pages =

Improving Language Models by Retrieving from Trillions of Tokens , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , organization =

work page 2022

[20] [20]

Proceedings of the 34th USENIX Security Symposium , pages =

PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models , author =. Proceedings of the 34th USENIX Security Symposium , pages =. 2025 , publisher =

work page 2025

[21] [21]

Prompt Injection attack against LLM-integrated Applications

Prompt Injection Attack against LLM-Integrated Applications , author =. arXiv preprint arXiv:2306.05499 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

2009 , publisher =

Causality: Models, Reasoning, and Inference , author =. 2009 , publisher =

work page 2009

[23] [23]

Advances in Neural Information Processing Systems , volume =

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , author =. Advances in Neural Information Processing Systems , volume =

work page

[24] [24]

Advances in Neural Information Processing Systems , volume =

Locating and Editing Factual Associations in GPT , author =. Advances in Neural Information Processing Systems , volume =

work page

[25] [25]

Journal of Machine Learning Research , volume =

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author =. Journal of Machine Learning Research , volume =

work page

[26] [26]

arXiv preprint arXiv:2512.16962 , year=

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval , author =. arXiv preprint arXiv:2512.16962 , year =

work page arXiv