pith. machine review for the scientific record. sign in

arxiv: 2605.09330 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords agentic memoryspurious correlationslanguage modelscalibration methodreasoning trajectoriesmemory architecturescausal structure
0
0 comments X

The pith

Agentic memory improves LLM reasoning on clean inputs but amplifies spurious correlations when present in trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how agentic memory, which lets language models retain and reuse information across multiple steps, creates a vulnerability to spurious correlations that propagate errors through reasoning chains. It benchmarks three types of these patterns identified via causal structure and finds that memory boosts performance when inputs are clean yet increases reliance on misleading associations once they appear. The authors introduce CAMEL as a calibration approach that adjusts memory at both writing and retrieval stages. A sympathetic reader would care because agentic systems are increasingly used for ongoing tasks, where uncorrected spurious patterns could compound mistakes over time.

Core claim

The paper claims that trajectory-level memory in agentic systems amplifies spurious correlations identified through causal structure, and that CAMEL, a plug-and-play calibration method operating at write and retrieval time, reduces reliance on these patterns across memory architectures while preserving or improving performance on clean inputs and remaining robust to adaptive attacks.

What carries the argument

CAMEL, a calibration method that operates across diverse memory architectures at both write and retrieval time to counteract spurious correlations.

If this is right

  • Agentic memory systems without calibration will show increased error propagation when spurious patterns exist in stored trajectories.
  • CAMEL can be added to existing memory architectures to lower spurious reliance while keeping gains on clean reasoning tasks.
  • The calibration remains effective against adaptive attacks that target the mitigation process itself.
  • Benchmarking via causal structure provides a diagnostic tool for identifying trajectory-level vulnerabilities before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests memory mechanisms in long-running agents should include default calibration steps to prevent error accumulation across extended interactions.
  • Similar calibration approaches could be tested on other forms of persistent state in AI systems beyond the three pattern types examined here.
  • The findings point toward a need for ongoing monitoring of memory content in deployed agents to catch emerging spurious correlations not covered in initial benchmarks.

Load-bearing premise

The benchmark patterns identified through causal structure accurately represent the spurious correlations that arise in real deployed agentic memory systems.

What would settle it

An experiment applying CAMEL to a deployed agentic system with naturally occurring spurious correlations outside the three benchmarked causal types, then measuring whether spurious reliance still decreases without harming clean performance.

Figures

Figures reproduced from arXiv: 2605.09330 by Dazheng Zhang, Luoxi Tang, Rupali Rajendra Vaje, Sakshi Sunil Narkar, Weicheng Ma, Yuqiao Meng, Zeyu Ding, Zhaohan Xi.

Figure 1
Figure 1. Figure 1: Three types of spurious correlations that mislead agentic memory: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Decision accuracy as a function of future steps after a spurious retrieval. Finding ❷: All memory architectures share a dom￾inant weakness, with one architecture-specific add￾on. Across most results (besides LoCoMo), T2 (unmea￾sured confounding) gives the highest spurious reason￾ing ratio. This is the hardest type in reasoning: the latent confounder is not recorded anywhere in mem￾ory, so retrieval has no … view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of CAMEL and baselines under adaptive attacks. Each bar indicates the spurious pattern type injected by an attacker aware of the calibration method. improves them jointly since CAMEL removes nuisance directions that hurt both clean and adversarial retrieval, so reducing spurious correlations and improving causal retrieval no longer compete. 5.2 Q2: Adaptive Attacks Attack setting. We evaluate CAME… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy-token cost tradeoff. Ra￾dius indicates the token cost variance. Results [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional future-step accuracy results after spurious retrievals. Setup follows Figure 2. [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
read the original abstract

Agentic memory enables LLMs to persist information beyond a single context window and reuse it in later decisions, but it also introduces a new vulnerability: spurious correlations, where retrieved memory carries miscorrelated evidence and propagates erroneous reasoning into downstream decisions. Despite the widespread use of agentic memory, this risk remains largely underexplored. We address it from two aspects. First, we benchmark several canonical types of spurious patterns identified through causal structure and record them across trajectory-level memory. Diagnosing agentic memory systems on this benchmark reveals that memory improves reasoning on clean inputs but amplifies reliance on spurious patterns when they are present. Second, we propose CAMEL, a plug-and-play calibration method that operates across diverse memory architectures at both write and retrieval time. CAMEL consistently reduces reliance on spurious patterns across all three types while preserving or improving performance on clean inputs and staying robust under adaptive attacks targeting the calibration. Overall, CAMEL offers a principled and lightweight solution toward more reliable agentic memory deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that agentic memory in LLMs improves reasoning on clean inputs but amplifies reliance on spurious correlations when present in trajectories. It benchmarks three canonical spurious pattern types identified via causal structure, and proposes CAMEL, a plug-and-play calibration method operating at write and retrieval time that reduces spurious reliance across patterns while preserving clean performance and remaining robust to adaptive attacks.

Significance. If the results hold, the work is significant for highlighting an underexplored vulnerability in widely used agentic memory systems and providing a lightweight, architecture-agnostic mitigation. Strengths include the empirical benchmarking across multiple pattern types and explicit robustness testing under adaptive attacks. However, the absence of experimental details and validation that synthetic patterns match real-world spurious correlations limits the strength of the conclusions.

major comments (2)
  1. Abstract: the abstract reports benchmark results and method effectiveness but provides no details on experimental design, statistical tests, or data construction. This is load-bearing for the central claims, as it prevents verification of whether memory amplification and CAMEL's reported gains are supported.
  2. Benchmark construction (implied in abstract): the central claim that memory amplifies spurious reliance while CAMEL mitigates it rests on three synthetic pattern types constructed by intervening on causal graphs in trajectories. The manuscript provides no direct evidence (e.g., observational studies on production traces or non-synthetic agent runs) that these patterns represent the distribution of spurious links that arise naturally from retrieval and reuse in deployed agentic memory systems.
minor comments (1)
  1. The three types of spurious patterns should be defined with explicit examples or causal diagrams in the main text for clarity, as the abstract refers to them without elaboration.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation and scope of our work on spurious correlations in agentic memory systems. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: Abstract: the abstract reports benchmark results and method effectiveness but provides no details on experimental design, statistical tests, or data construction. This is load-bearing for the central claims, as it prevents verification of whether memory amplification and CAMEL's reported gains are supported.

    Authors: We agree that the abstract would benefit from additional context to support verification of the claims. In the revised version, we will expand the abstract to briefly describe the benchmark construction via causal graph interventions on trajectories, the evaluation setup comparing clean versus spurious inputs, and note that performance metrics are reported as averages over multiple independent runs with standard deviations provided in the main text and appendix. Full details on data construction, metrics, and any statistical analyses remain in Sections 3 and 4. This change will make the central results more immediately verifiable while respecting abstract length constraints. revision: yes

  2. Referee: Benchmark construction (implied in abstract): the central claim that memory amplifies spurious reliance while CAMEL mitigates it rests on three synthetic pattern types constructed by intervening on causal graphs in trajectories. The manuscript provides no direct evidence (e.g., observational studies on production traces or non-synthetic agent runs) that these patterns represent the distribution of spurious links that arise naturally from retrieval and reuse in deployed agentic memory systems.

    Authors: We acknowledge that the benchmarks are synthetically constructed through targeted interventions on causal structures to isolate canonical spurious pattern types, as detailed in Section 3. This design enables rigorous, controlled measurement of memory amplification and CAMEL's mitigation effects, which would be difficult to isolate in uncontrolled real-world traces. We do not include observational studies from production agent runs in the current work. We will add an explicit limitations subsection in the Discussion to justify the synthetic approach, discuss its implications for generalizability, and suggest directions for future real-world validation studies. revision: partial

standing simulated objections not resolved
  • Direct empirical validation that the three synthetic spurious patterns match the distribution of naturally occurring spurious correlations in deployed production agentic memory systems.

Circularity Check

0 steps flagged

No circularity: empirical benchmarking and calibration method are self-contained

full rationale

The paper conducts an empirical study by constructing synthetic benchmarks of spurious patterns via causal structures in trajectories, evaluating memory systems on them, and proposing the CAMEL calibration method. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The central claims rest on experimental outcomes rather than any chain that reduces by construction to the inputs. The benchmark patterns are presented as an evaluation tool, not as a self-defined result. This is a standard empirical setup with external falsifiability through the reported experiments and robustness tests, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, fitted parameters, axioms, or newly postulated entities; the work is framed as empirical diagnosis and a practical mitigation method.

pith-pipeline@v0.9.0 · 5504 in / 981 out tokens · 31183 ms · 2026-05-12T04:39:52.590374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    CAMEL intervenes at write time by subtracting the step mean μ(s) from each memory embedding, retaining only content-specific signal; at retrieval time it tests stability under perturbations along non-causal directions.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 7 internal anchors

  1. [1]

    Arigraph: Learning knowledge graph world models with episodic memory for llm agents.arXiv preprint arXiv:2407.04363, 2024

    Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents.arXiv preprint arXiv:2407.04363, 2024

  2. [2]

    Invariant Risk Minimization

    Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

  3. [3]

    An optimal algorithm for approximate nearest neighbor searching fixed dimensions.Journal of the ACM (JACM), 45(6):891–923, 1998

    Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions.Journal of the ACM (JACM), 45(6):891–923, 1998

  4. [4]

    Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995

    Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995

  5. [5]

    Chateval: Towards better llm-based evaluators through multi-agent debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. In International conference on learning representations, volume 2024, pages 9079–9093, 2024

  6. [6]

    M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the association for computational linguistics: ACL 2024, pages 2318–2335, 2024

  7. [7]

    Agentpoison: Red- teaming llm agents via poisoning memory or knowledge bases.Advances in Neural Information Processing Systems, 37:130185–130213, 2024

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red- teaming llm agents via poisoning memory or knowledge bases.Advances in Neural Information Processing Systems, 37:130185–130213, 2024

  8. [8]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  9. [9]

    A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

    Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

  10. [10]

    Shortcut learning of large language models in natural language understanding.Communications of the ACM, 67(1):110– 120, 2023

    Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. Shortcut learning of large language models in natural language understanding.Communications of the ACM, 67(1):110– 120, 2023

  11. [11]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. 10

  12. [12]

    Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, et al. Causalt5k: Diagnosing and informing refusal for trustworthy causal reasoning of skepticism, sycophancy, detection- correction, and rung collapse.arXiv preprint arXiv:2602.08939, 2026

  13. [13]

    What is a spurious correlation?Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2):125–132, 2003

    Brian D Haig. What is a spurious correlation?Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2):125–132, 2003

  14. [14]

    Unsupervised Dense Information Retrieval with Contrastive Learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021

  15. [15]

    Just train twice: Improving group robustness without training group information

    Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. InInternational Conference on Machine Learning, pages 6781–6792. PMLR, 2021

  16. [16]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  17. [17]

    Cylens: Towards reinventing cyber threat intelligence in the paradigm of agentic large language models.arXiv preprint arXiv:2502.20791, 2025

    Xiaoqun Liu, Jiacheng Liang, Qiben Yan, Jiyong Jang, Sicheng Mao, Muchao Ye, Jinyuan Jia, and Zhaohan Xi. Cylens: Towards reinventing cyber threat intelligence in the paradigm of agentic large language models.arXiv preprint arXiv:2502.20791, 2025

  18. [18]

    How implicit bias accumulates and propagates in llm long-term memory

    Yiming Ma, Lixu Wang, Lionel Z Wang, Hongkun Yang, Haoming Sun, Xin Xu, Jiaqi Wu, Bin Chen, and Wei Dong. How implicit bias accumulates and propagates in llm long-term memory. arXiv preprint arXiv:2602.01558, 2026

  19. [19]

    Evaluating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  20. [20]

    World Scientific, 1997

    Arakaparampil M Mathai.Jacobians of matrix transformations and functions of matrix argu- ments. World Scientific, 1997

  21. [21]

    Uncovering vulnerabilities of llm-assisted cyber threat intelligence.arXiv preprint arXiv:2509.23573, 2025

    Yuqiao Meng, Luoxi Tang, Feiyang Yu, Jinyuan Jia, Guanhua Yan, Ping Yang, and Zhao- han Xi. Uncovering vulnerabilities of llm-assisted cyber threat intelligence.arXiv preprint arXiv:2509.23573, 2025

  22. [22]

    Benchmarking llm-assisted blue teaming via standardized threat hunting.arXiv preprint arXiv:2509.23571, 2025

    Yuqiao Meng, Luoxi Tang, Feiyang Yu, Xi Li, Guanhua Yan, Ping Yang, and Zhaohan Xi. Benchmarking llm-assisted blue teaming via standardized threat hunting.arXiv preprint arXiv:2509.23571, 2025

  23. [23]

    Small agent group is the future of digital health.arXiv preprint arXiv:2602.08013, 2026

    Yuqiao Meng, Luoxi Tang, Dazheng Zhang, Rafael Brens, Elvys J Romero, Nancy Guo, Safa Elkefi, and Zhaohan Xi. Small agent group is the future of digital health.arXiv preprint arXiv:2602.08013, 2026

  24. [24]

    Theramind: a multi-llm ensemble for accelerating drug repurposing in lung cancer via case report mining.npj Precision Oncology, 2026

    Vrushket More, Lyra Lu, Zeyu Ding, Zhaohan Xi, Seth Mizia, and Nancy L Guo. Theramind: a multi-llm ensemble for accelerating drug repurposing in lung cancer via case report mining.npj Precision Oncology, 2026

  25. [25]

    Causality: models, reasoning, and inference, by judea pearl, cambridge university press, 2000.Econometric Theory, 19(4):675–685, 2003

    Leland Gerson Neuberg. Causality: models, reasoning, and inference, by judea pearl, cambridge university press, 2000.Econometric Theory, 19(4):675–685, 2003

  26. [26]

    Understanding bias reinforcement in llm agents debate.arXiv preprint arXiv:2503.16814, 2025

    Jihwan Oh, Minchan Jeong, Jongwoo Ko, and Se-Young Yun. Understanding bias reinforcement in llm agents debate.arXiv preprint arXiv:2503.16814, 2025

  27. [27]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023. 11

  28. [28]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  29. [29]

    Basic Books, 2018

    Judea Pearl.The book of why: The new science of cause and effect. Basic Books, 2018

  30. [30]

    Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals.Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012, 2016

  31. [31]

    Causal discovery with continuous additive noise models

    Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. 2014

  32. [32]

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019

  33. [33]

    Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

  34. [34]

    Recommendations as treatments: Debiasing learning and evaluation

    Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. Ininternational conference on machine learning, pages 1670–1679. PMLR, 2016

  35. [35]

    Causality for machine learning

    Bernhard Schölkopf. Causality for machine learning. InProbabilistic and causal inference: The works of Judea Pearl, pages 765–804. 2022

  36. [36]

    Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

    Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

  37. [37]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  38. [38]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  39. [39]

    Causal invariance in reasoning and learning.Psychology of learning and motivation, 44:287–326, 2004

    Steven Sloman and David A Lagnado. Causal invariance in reasoning and learning.Psychology of learning and motivation, 44:287–326, 2004

  40. [40]

    MIT press, 2000

    Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search. MIT press, 2000

  41. [41]

    Adversarial network imagination: Causal llms and digital twins for proactive telecom mitigation.arXiv preprint arXiv:2602.13203, 2026

    Vignesh Sriram, Yuqiao Meng, Luoxi Tang, and Zhaohan Xi. Adversarial network imagination: Causal llms and digital twins for proactive telecom mitigation.arXiv preprint arXiv:2602.13203, 2026

  42. [42]

    Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

  43. [43]

    Memory poisoning attack and defense on memory based llm-agents,

    Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, and Shuchi Mishra. Memory poisoning attack and defense on memory based llm-agents. arXiv preprint arXiv:2601.05504, 2026

  44. [44]

    The value of variance: Mitigating debate collapse in multi-agent systems via uncertainty-driven policy optimization.arXiv preprint arXiv:2602.07186, 2026

    Luoxi Tang, Yuqiao Meng, Joseph Costa, Yingxue Zhang, Muchao Ye, and Zhaohan Xi. The value of variance: Mitigating debate collapse in multi-agent systems via uncertainty-driven policy optimization.arXiv preprint arXiv:2602.07186, 2026. 12

  45. [45]

    Po- lar: Automating cyber threat prioritization through llm-powered assessment.arXiv preprint arXiv:2510.01552, 2025

    Luoxi Tang, Yuqiao Meng, Ankita Patra, Weicheng Ma, Muchao Ye, and Zhaohan Xi. Po- lar: Automating cyber threat prioritization through llm-powered assessment.arXiv preprint arXiv:2510.01552, 2025

  46. [46]

    Large language models can be lazy learners: Analyze shortcuts in in-context learning

    Ruixiang Tang, Dehan Kong, Longtao Huang, et al. Large language models can be lazy learners: Analyze shortcuts in in-context learning. InFindings of the association for computational linguistics: ACL 2023, pages 4645–4657, 2023

  47. [47]

    Microsoft academic graph: When experts are not enough.Quantitative Science Studies, 1(1):396–413, 2020

    Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough.Quantitative Science Studies, 1(1):396–413, 2020

  48. [48]

    Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

    Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

  49. [49]

    Badagent: Inserting and activating backdoor attacks in llm agents

    Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. Badagent: Inserting and activating backdoor attacks in llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, 2024

  50. [50]

    A-MemGuard: A proactive defense framework for LLM-based agent memory.arXiv preprint arXiv:2510.02373, 2025

    Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, and XiaoFeng Wang. A-memguard: A proactive defense frame- work for llm-based agent memory.arXiv preprint arXiv:2510.02373, 2025

  51. [51]

    Berkson’s bias, selection bias, and missing data.Epidemiology, 23(1):159– 164, 2012

    Daniel Westreich. Berkson’s bias, selection bias, and missing data.Epidemiology, 23(1):159– 164, 2012

  52. [52]

    Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention

    Junda Wu, Tong Yu, Xiang Chen, Haoliang Wang, Ryan Rossi, Sungchul Kim, Anup Rao, and Julian McAuley. Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14073–14087, 2024

  53. [53]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

  54. [54]

    From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs, April 2025

    Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of llms.arXiv preprint arXiv:2504.15965, 2025

  55. [55]

    All your knowledge belongs to us: Stealing knowledge graphs via reasoning apis

    Zhaohan Xi. All your knowledge belongs to us: Stealing knowledge graphs via reasoning apis. arXiv preprint arXiv:2503.09727, 2025

  56. [56]

    Causality learning: A new perspective for interpretable machine learning.arXiv preprint arXiv:2006.16789, 2020

    Guandong Xu, Tri Dung Duong, Qian Li, Shaowu Liu, and Xianzhi Wang. Causality learning: A new perspective for interpretable machine learning.arXiv preprint arXiv:2006.16789, 2020

  57. [57]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  58. [58]

    On the eligibility of llms for counterfactual reasoning: a decompositional study.arXiv preprint arXiv:2505.11839, 2025

    Shuai Yang, Qi Yang, Luoxi Tang, Yuqiao Meng, Nancy Guo, Jeremy Blackburn, and Zhaohan Xi. On the eligibility of llms for counterfactual reasoning: a decompositional study.arXiv preprint arXiv:2505.11839, 2025

  59. [59]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  60. [60]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  61. [61]

    Spurious correlations in machine learning: A survey.arXiv e-prints, pages arXiv–2402, 2024

    Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang. Spurious correlations in machine learning: A survey.arXiv e-prints, pages arXiv–2402, 2024. 13

  62. [62]

    G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025

    Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025

  63. [63]

    A survey on the memory mechanism of large language model-based agents

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

  64. [64]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

  65. [65]

    Zodiac: A cardiologist-level llm framework for multi-agent diagnostics,

    Yuan Zhou, Peng Zhang, Mengya Song, Alice Zheng, Yiwen Lu, Zhiheng Liu, Yong Chen, and Zhaohan Xi. Zodiac: A cardiologist-level llm framework for multi-agent diagnostics.arXiv preprint arXiv:2410.02026, 2024. A Validation of Identified Spurious Correlations Causal discovery from observational data is inherently uncertain, so we validate each identified sp...

  66. [66]

    Compute ˜hm =h m −µ (s) using thecurrentmean (the mean over the n entries already written in this step)

  67. [67]

    Store ˜hm in the ANN index

  68. [68]

    causally relevant to the query

    Updateµ (s) ←µ (s) + 1 n+1(hm −µ (s)), then incrementn←n+ 1. The order matters: residualization uses the pre-update mean, so ˜hm measures how m departs from the context already established by earlier entries in the step. Reversing steps 1 and 3 would subtract hm partly from itself. Counter and state.The counter n tracks entries written in the current step...

  69. [69]

    Obtainh m via Appendix D.1 (existing embedding or fresh text encoding)

  70. [70]

    (2) to get ˜hm and update µ(s)

    Apply Eq. (2) to get ˜hm and update µ(s). Implementation details (initialization, ordering, step closure) are in Appendix C.3

  71. [71]

    Apply the content-novelty write criterion (Appendix C.2) on ˜hm. For a graph memory the structural form of the criterion (wherein the node is inserted only when it would form at least one edge to a node from a different episode/session) is the natural analogue of the embedding-side cosine check, and again reads no outcome variable

  72. [72]

    customer rating above 4.3

    Use ˜hm wherever the host system would have used hm. Any cosine or dot-product score the host computes between nodes is now computed on residualized vectors; edges formed by feature similarity automatically avoid connecting nodes whose only commonality was shared step-level context. The graph topology, edge types, and node text remain untouched: only the ...