arxiv: 2605.09330 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

Luoxi Tang , Rupali Rajendra Vaje , Yuqiao Meng , Sakshi Sunil Narkar , Weicheng Ma , Zeyu Ding , Dazheng Zhang , Zhaohan Xi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords agentic memoryspurious correlationslanguage modelscalibration methodreasoning trajectoriesmemory architecturescausal structure

0 comments

The pith

Agentic memory improves LLM reasoning on clean inputs but amplifies spurious correlations when present in trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how agentic memory, which lets language models retain and reuse information across multiple steps, creates a vulnerability to spurious correlations that propagate errors through reasoning chains. It benchmarks three types of these patterns identified via causal structure and finds that memory boosts performance when inputs are clean yet increases reliance on misleading associations once they appear. The authors introduce CAMEL as a calibration approach that adjusts memory at both writing and retrieval stages. A sympathetic reader would care because agentic systems are increasingly used for ongoing tasks, where uncorrected spurious patterns could compound mistakes over time.

Core claim

The paper claims that trajectory-level memory in agentic systems amplifies spurious correlations identified through causal structure, and that CAMEL, a plug-and-play calibration method operating at write and retrieval time, reduces reliance on these patterns across memory architectures while preserving or improving performance on clean inputs and remaining robust to adaptive attacks.

What carries the argument

CAMEL, a calibration method that operates across diverse memory architectures at both write and retrieval time to counteract spurious correlations.

If this is right

Agentic memory systems without calibration will show increased error propagation when spurious patterns exist in stored trajectories.
CAMEL can be added to existing memory architectures to lower spurious reliance while keeping gains on clean reasoning tasks.
The calibration remains effective against adaptive attacks that target the mitigation process itself.
Benchmarking via causal structure provides a diagnostic tool for identifying trajectory-level vulnerabilities before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests memory mechanisms in long-running agents should include default calibration steps to prevent error accumulation across extended interactions.
Similar calibration approaches could be tested on other forms of persistent state in AI systems beyond the three pattern types examined here.
The findings point toward a need for ongoing monitoring of memory content in deployed agents to catch emerging spurious correlations not covered in initial benchmarks.

Load-bearing premise

The benchmark patterns identified through causal structure accurately represent the spurious correlations that arise in real deployed agentic memory systems.

What would settle it

An experiment applying CAMEL to a deployed agentic system with naturally occurring spurious correlations outside the three benchmarked causal types, then measuring whether spurious reliance still decreases without harming clean performance.

Figures

Figures reproduced from arXiv: 2605.09330 by Dazheng Zhang, Luoxi Tang, Rupali Rajendra Vaje, Sakshi Sunil Narkar, Weicheng Ma, Yuqiao Meng, Zeyu Ding, Zhaohan Xi.

**Figure 2.** Figure 2: Decision accuracy as a function of future steps after a spurious retrieval. Finding ❷: All memory architectures share a dominant weakness, with one architecture-specific addon. Across most results (besides LoCoMo), T2 (unmeasured confounding) gives the highest spurious reasoning ratio. This is the hardest type in reasoning: the latent confounder is not recorded anywhere in memory, so retrieval has no … view at source ↗

**Figure 3.** Figure 3: Accuracy of CAMEL and baselines under adaptive attacks. Each bar indicates the spurious pattern type injected by an attacker aware of the calibration method. improves them jointly since CAMEL removes nuisance directions that hurt both clean and adversarial retrieval, so reducing spurious correlations and improving causal retrieval no longer compete. 5.2 Q2: Adaptive Attacks Attack setting. We evaluate CAME… view at source ↗

**Figure 4.** Figure 4: Accuracy-token cost tradeoff. Radius indicates the token cost variance. Results [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Additional future-step accuracy results after spurious retrievals. Setup follows Figure 2. [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

read the original abstract

Agentic memory enables LLMs to persist information beyond a single context window and reuse it in later decisions, but it also introduces a new vulnerability: spurious correlations, where retrieved memory carries miscorrelated evidence and propagates erroneous reasoning into downstream decisions. Despite the widespread use of agentic memory, this risk remains largely underexplored. We address it from two aspects. First, we benchmark several canonical types of spurious patterns identified through causal structure and record them across trajectory-level memory. Diagnosing agentic memory systems on this benchmark reveals that memory improves reasoning on clean inputs but amplifies reliance on spurious patterns when they are present. Second, we propose CAMEL, a plug-and-play calibration method that operates across diverse memory architectures at both write and retrieval time. CAMEL consistently reduces reliance on spurious patterns across all three types while preserving or improving performance on clean inputs and staying robust under adaptive attacks targeting the calibration. Overall, CAMEL offers a principled and lightweight solution toward more reliable agentic memory deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agentic memory amplifies spurious correlations on synthetic benchmarks, and CAMEL offers a calibration fix, but generalization to real deployments remains unproven.

read the letter

The core takeaway here is that agentic memory, while helpful for long-term reasoning in LLM agents, can lock in and spread spurious correlations across decision trajectories. The authors benchmark this effect using causal interventions to create specific spurious patterns and show memory amplifying the problem. They then give us CAMEL, a calibration approach that tweaks memory handling at write and read time to dial down the spurious stuff. What the paper does well is to isolate the trajectory aspect and test the method across different memory setups. The results indicate CAMEL keeps performance on clean cases steady or better and holds against attempts to game the calibration. That's a solid practical angle for a growing area like agentic systems. Where it gets soft is in the benchmark construction. Everything rests on three synthetic pattern types built via causal graphs. The concern that these may not reflect the messier spurious links that arise in actual retrieval and reuse loops is fair. Without some check against real agent traces or non-synthetic runs, it's possible the observed amplification and the reported gains are specific to how the test cases were made up. The abstract also skips over experimental details like data generation, controls, or significance testing, which leaves the claims a bit under-supported for now. This work is for folks building or analyzing LLM agents with memory modules. A practitioner or researcher in that space would find the CAMEL method worth trying on their own tasks, even if they have to adapt it. It deserves to go to peer review because it identifies a real vulnerability and offers a starting fix, but the referees will likely push for better validation of the benchmarks and more details on the experiments.

Referee Report

2 major / 1 minor

Summary. The paper claims that agentic memory in LLMs improves reasoning on clean inputs but amplifies reliance on spurious correlations when present in trajectories. It benchmarks three canonical spurious pattern types identified via causal structure, and proposes CAMEL, a plug-and-play calibration method operating at write and retrieval time that reduces spurious reliance across patterns while preserving clean performance and remaining robust to adaptive attacks.

Significance. If the results hold, the work is significant for highlighting an underexplored vulnerability in widely used agentic memory systems and providing a lightweight, architecture-agnostic mitigation. Strengths include the empirical benchmarking across multiple pattern types and explicit robustness testing under adaptive attacks. However, the absence of experimental details and validation that synthetic patterns match real-world spurious correlations limits the strength of the conclusions.

major comments (2)

Abstract: the abstract reports benchmark results and method effectiveness but provides no details on experimental design, statistical tests, or data construction. This is load-bearing for the central claims, as it prevents verification of whether memory amplification and CAMEL's reported gains are supported.
Benchmark construction (implied in abstract): the central claim that memory amplifies spurious reliance while CAMEL mitigates it rests on three synthetic pattern types constructed by intervening on causal graphs in trajectories. The manuscript provides no direct evidence (e.g., observational studies on production traces or non-synthetic agent runs) that these patterns represent the distribution of spurious links that arise naturally from retrieval and reuse in deployed agentic memory systems.

minor comments (1)

The three types of spurious patterns should be defined with explicit examples or causal diagrams in the main text for clarity, as the abstract refers to them without elaboration.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation and scope of our work on spurious correlations in agentic memory systems. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: Abstract: the abstract reports benchmark results and method effectiveness but provides no details on experimental design, statistical tests, or data construction. This is load-bearing for the central claims, as it prevents verification of whether memory amplification and CAMEL's reported gains are supported.

Authors: We agree that the abstract would benefit from additional context to support verification of the claims. In the revised version, we will expand the abstract to briefly describe the benchmark construction via causal graph interventions on trajectories, the evaluation setup comparing clean versus spurious inputs, and note that performance metrics are reported as averages over multiple independent runs with standard deviations provided in the main text and appendix. Full details on data construction, metrics, and any statistical analyses remain in Sections 3 and 4. This change will make the central results more immediately verifiable while respecting abstract length constraints. revision: yes
Referee: Benchmark construction (implied in abstract): the central claim that memory amplifies spurious reliance while CAMEL mitigates it rests on three synthetic pattern types constructed by intervening on causal graphs in trajectories. The manuscript provides no direct evidence (e.g., observational studies on production traces or non-synthetic agent runs) that these patterns represent the distribution of spurious links that arise naturally from retrieval and reuse in deployed agentic memory systems.

Authors: We acknowledge that the benchmarks are synthetically constructed through targeted interventions on causal structures to isolate canonical spurious pattern types, as detailed in Section 3. This design enables rigorous, controlled measurement of memory amplification and CAMEL's mitigation effects, which would be difficult to isolate in uncontrolled real-world traces. We do not include observational studies from production agent runs in the current work. We will add an explicit limitations subsection in the Discussion to justify the synthetic approach, discuss its implications for generalizability, and suggest directions for future real-world validation studies. revision: partial

standing simulated objections not resolved

Direct empirical validation that the three synthetic spurious patterns match the distribution of naturally occurring spurious correlations in deployed production agentic memory systems.

Circularity Check

0 steps flagged

No circularity: empirical benchmarking and calibration method are self-contained

full rationale

The paper conducts an empirical study by constructing synthetic benchmarks of spurious patterns via causal structures in trajectories, evaluating memory systems on them, and proposing the CAMEL calibration method. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The central claims rest on experimental outcomes rather than any chain that reduces by construction to the inputs. The benchmark patterns are presented as an evaluation tool, not as a self-defined result. This is a standard empirical setup with external falsifiability through the reported experiments and robustness tests, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, fitted parameters, axioms, or newly postulated entities; the work is framed as empirical diagnosis and a practical mitigation method.

pith-pipeline@v0.9.0 · 5504 in / 981 out tokens · 31183 ms · 2026-05-12T04:39:52.590374+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CAMEL intervenes at write time by subtracting the step mean μ(s) from each memory embedding, retaining only content-specific signal; at retrieval time it tests stability under perturbations along non-causal directions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 7 internal anchors

[1]

Arigraph: Learning knowledge graph world models with episodic memory for llm agents.arXiv preprint arXiv:2407.04363, 2024

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents.arXiv preprint arXiv:2407.04363, 2024

work page arXiv 2024
[2]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review arXiv 1907
[3]

An optimal algorithm for approximate nearest neighbor searching fixed dimensions.Journal of the ACM (JACM), 45(6):891–923, 1998

Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions.Journal of the ACM (JACM), 45(6):891–923, 1998

work page 1998
[4]

Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995

work page 1995
[5]

Chateval: Towards better llm-based evaluators through multi-agent debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. In International conference on learning representations, volume 2024, pages 9079–9093, 2024

work page 2024
[6]

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the association for computational linguistics: ACL 2024, pages 2318–2335, 2024

work page 2024
[7]

Agentpoison: Red- teaming llm agents via poisoning memory or knowledge bases.Advances in Neural Information Processing Systems, 37:130185–130213, 2024

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red- teaming llm agents via poisoning memory or knowledge bases.Advances in Neural Information Processing Systems, 37:130185–130213, 2024

work page 2024
[8]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025

work page 2025
[10]

Shortcut learning of large language models in natural language understanding.Communications of the ACM, 67(1):110– 120, 2023

Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. Shortcut learning of large language models in natural language understanding.Communications of the ACM, 67(1):110– 120, 2023

work page 2023
[11]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. 10

work page 2020
[12]

Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, et al. Causalt5k: Diagnosing and informing refusal for trustworthy causal reasoning of skepticism, sycophancy, detection- correction, and rung collapse.arXiv preprint arXiv:2602.08939, 2026

work page arXiv 2026
[13]

What is a spurious correlation?Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2):125–132, 2003

Brian D Haig. What is a spurious correlation?Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2):125–132, 2003

work page 2003
[14]

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021

work page internal anchor Pith review arXiv 2021
[15]

Just train twice: Improving group robustness without training group information

Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. InInternational Conference on Machine Learning, pages 6781–6792. PMLR, 2021

work page 2021
[16]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024
[17]

Cylens: Towards reinventing cyber threat intelligence in the paradigm of agentic large language models.arXiv preprint arXiv:2502.20791, 2025

Xiaoqun Liu, Jiacheng Liang, Qiben Yan, Jiyong Jang, Sicheng Mao, Muchao Ye, Jinyuan Jia, and Zhaohan Xi. Cylens: Towards reinventing cyber threat intelligence in the paradigm of agentic large language models.arXiv preprint arXiv:2502.20791, 2025

work page arXiv 2025
[18]

How implicit bias accumulates and propagates in llm long-term memory

Yiming Ma, Lixu Wang, Lionel Z Wang, Hongkun Yang, Haoming Sun, Xin Xu, Jiaqi Wu, Bin Chen, and Wei Dong. How implicit bias accumulates and propagates in llm long-term memory. arXiv preprint arXiv:2602.01558, 2026

work page arXiv 2026
[19]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024
[20]

World Scientific, 1997

Arakaparampil M Mathai.Jacobians of matrix transformations and functions of matrix argu- ments. World Scientific, 1997

work page 1997
[21]

Uncovering vulnerabilities of llm-assisted cyber threat intelligence.arXiv preprint arXiv:2509.23573, 2025

Yuqiao Meng, Luoxi Tang, Feiyang Yu, Jinyuan Jia, Guanhua Yan, Ping Yang, and Zhao- han Xi. Uncovering vulnerabilities of llm-assisted cyber threat intelligence.arXiv preprint arXiv:2509.23573, 2025

work page arXiv 2025
[22]

Benchmarking llm-assisted blue teaming via standardized threat hunting.arXiv preprint arXiv:2509.23571, 2025

Yuqiao Meng, Luoxi Tang, Feiyang Yu, Xi Li, Guanhua Yan, Ping Yang, and Zhaohan Xi. Benchmarking llm-assisted blue teaming via standardized threat hunting.arXiv preprint arXiv:2509.23571, 2025

work page arXiv 2025
[23]

Small agent group is the future of digital health.arXiv preprint arXiv:2602.08013, 2026

Yuqiao Meng, Luoxi Tang, Dazheng Zhang, Rafael Brens, Elvys J Romero, Nancy Guo, Safa Elkefi, and Zhaohan Xi. Small agent group is the future of digital health.arXiv preprint arXiv:2602.08013, 2026

work page arXiv 2026
[24]

Theramind: a multi-llm ensemble for accelerating drug repurposing in lung cancer via case report mining.npj Precision Oncology, 2026

Vrushket More, Lyra Lu, Zeyu Ding, Zhaohan Xi, Seth Mizia, and Nancy L Guo. Theramind: a multi-llm ensemble for accelerating drug repurposing in lung cancer via case report mining.npj Precision Oncology, 2026

work page 2026
[25]

Causality: models, reasoning, and inference, by judea pearl, cambridge university press, 2000.Econometric Theory, 19(4):675–685, 2003

Leland Gerson Neuberg. Causality: models, reasoning, and inference, by judea pearl, cambridge university press, 2000.Econometric Theory, 19(4):675–685, 2003

work page 2000
[26]

Understanding bias reinforcement in llm agents debate.arXiv preprint arXiv:2503.16814, 2025

Jihwan Oh, Minchan Jeong, Jongwoo Ko, and Se-Young Yun. Understanding bias reinforcement in llm agents debate.arXiv preprint arXiv:2503.16814, 2025

work page arXiv 2025
[27]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023. 11

work page 2023
[28]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[29]

Basic Books, 2018

Judea Pearl.The book of why: The new science of cause and effect. Basic Books, 2018

work page 2018
[30]

Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals.Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012, 2016

work page 2016
[31]

Causal discovery with continuous additive noise models

Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. 2014

work page 2014
[32]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019

work page internal anchor Pith review arXiv 1911
[33]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

work page 2023
[34]

Recommendations as treatments: Debiasing learning and evaluation

Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. Ininternational conference on machine learning, pages 1670–1679. PMLR, 2016

work page 2016
[35]

Causality for machine learning

Bernhard Schölkopf. Causality for machine learning. InProbabilistic and causal inference: The works of Judea Pearl, pages 765–804. 2022

work page 2022
[36]

Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

work page 2021
[37]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[38]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review arXiv 2010
[39]

Causal invariance in reasoning and learning.Psychology of learning and motivation, 44:287–326, 2004

Steven Sloman and David A Lagnado. Causal invariance in reasoning and learning.Psychology of learning and motivation, 44:287–326, 2004

work page 2004
[40]

MIT press, 2000

Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search. MIT press, 2000

work page 2000
[41]

Adversarial network imagination: Causal llms and digital twins for proactive telecom mitigation.arXiv preprint arXiv:2602.13203, 2026

Vignesh Sriram, Yuqiao Meng, Luoxi Tang, and Zhaohan Xi. Adversarial network imagination: Causal llms and digital twins for proactive telecom mitigation.arXiv preprint arXiv:2602.13203, 2026

work page arXiv 2026
[42]

Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025

work page arXiv 2025
[43]

Memory poisoning attack and defense on memory based llm-agents,

Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, and Shuchi Mishra. Memory poisoning attack and defense on memory based llm-agents. arXiv preprint arXiv:2601.05504, 2026

work page arXiv 2026
[44]

The value of variance: Mitigating debate collapse in multi-agent systems via uncertainty-driven policy optimization.arXiv preprint arXiv:2602.07186, 2026

Luoxi Tang, Yuqiao Meng, Joseph Costa, Yingxue Zhang, Muchao Ye, and Zhaohan Xi. The value of variance: Mitigating debate collapse in multi-agent systems via uncertainty-driven policy optimization.arXiv preprint arXiv:2602.07186, 2026. 12

work page arXiv 2026
[45]

Po- lar: Automating cyber threat prioritization through llm-powered assessment.arXiv preprint arXiv:2510.01552, 2025

Luoxi Tang, Yuqiao Meng, Ankita Patra, Weicheng Ma, Muchao Ye, and Zhaohan Xi. Po- lar: Automating cyber threat prioritization through llm-powered assessment.arXiv preprint arXiv:2510.01552, 2025

work page arXiv 2025
[46]

Large language models can be lazy learners: Analyze shortcuts in in-context learning

Ruixiang Tang, Dehan Kong, Longtao Huang, et al. Large language models can be lazy learners: Analyze shortcuts in in-context learning. InFindings of the association for computational linguistics: ACL 2023, pages 4645–4657, 2023

work page 2023
[47]

Microsoft academic graph: When experts are not enough.Quantitative Science Studies, 1(1):396–413, 2020

Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough.Quantitative Science Studies, 1(1):396–413, 2020

work page 2020
[48]

Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022

work page 2022
[49]

Badagent: Inserting and activating backdoor attacks in llm agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. Badagent: Inserting and activating backdoor attacks in llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, 2024

work page 2024
[50]

A-MemGuard: A proactive defense framework for LLM-based agent memory.arXiv preprint arXiv:2510.02373, 2025

Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, and XiaoFeng Wang. A-memguard: A proactive defense frame- work for llm-based agent memory.arXiv preprint arXiv:2510.02373, 2025

work page arXiv 2025
[51]

Berkson’s bias, selection bias, and missing data.Epidemiology, 23(1):159– 164, 2012

Daniel Westreich. Berkson’s bias, selection bias, and missing data.Epidemiology, 23(1):159– 164, 2012

work page 2012
[52]

Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention

Junda Wu, Tong Yu, Xiang Chen, Haoliang Wang, Ryan Rossi, Sungchul Kim, Anup Rao, and Julian McAuley. Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14073–14087, 2024

work page 2024
[53]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[54]

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs, April 2025

Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of llms.arXiv preprint arXiv:2504.15965, 2025

work page arXiv 2025
[55]

All your knowledge belongs to us: Stealing knowledge graphs via reasoning apis

Zhaohan Xi. All your knowledge belongs to us: Stealing knowledge graphs via reasoning apis. arXiv preprint arXiv:2503.09727, 2025

work page arXiv 2025
[56]

Causality learning: A new perspective for interpretable machine learning.arXiv preprint arXiv:2006.16789, 2020

Guandong Xu, Tri Dung Duong, Qian Li, Shaowu Liu, and Xianzhi Wang. Causality learning: A new perspective for interpretable machine learning.arXiv preprint arXiv:2006.16789, 2020

work page arXiv 2006
[57]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

On the eligibility of llms for counterfactual reasoning: a decompositional study.arXiv preprint arXiv:2505.11839, 2025

Shuai Yang, Qi Yang, Luoxi Tang, Yuqiao Meng, Nancy Guo, Jeremy Blackburn, and Zhaohan Xi. On the eligibility of llms for counterfactual reasoning: a decompositional study.arXiv preprint arXiv:2505.11839, 2025

work page arXiv 2025
[59]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022
[60]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Spurious correlations in machine learning: A survey.arXiv e-prints, pages arXiv–2402, 2024

Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang. Spurious correlations in machine learning: A survey.arXiv e-prints, pages arXiv–2402, 2024. 13

work page 2024
[62]

G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025

work page arXiv 2025
[63]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

work page 2025
[64]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024
[65]

Zodiac: A cardiologist-level llm framework for multi-agent diagnostics,

Yuan Zhou, Peng Zhang, Mengya Song, Alice Zheng, Yiwen Lu, Zhiheng Liu, Yong Chen, and Zhaohan Xi. Zodiac: A cardiologist-level llm framework for multi-agent diagnostics.arXiv preprint arXiv:2410.02026, 2024. A Validation of Identified Spurious Correlations Causal discovery from observational data is inherently uncertain, so we validate each identified sp...

work page arXiv 2024
[66]

Compute ˜hm =h m −µ (s) using thecurrentmean (the mean over the n entries already written in this step)

work page
[67]

Store ˜hm in the ANN index

work page
[68]

causally relevant to the query

Updateµ (s) ←µ (s) + 1 n+1(hm −µ (s)), then incrementn←n+ 1. The order matters: residualization uses the pre-update mean, so ˜hm measures how m departs from the context already established by earlier entries in the step. Reversing steps 1 and 3 would subtract hm partly from itself. Counter and state.The counter n tracks entries written in the current step...

work page
[69]

Obtainh m via Appendix D.1 (existing embedding or fresh text encoding)

work page
[70]

(2) to get ˜hm and update µ(s)

Apply Eq. (2) to get ˜hm and update µ(s). Implementation details (initialization, ordering, step closure) are in Appendix C.3

work page
[71]

Apply the content-novelty write criterion (Appendix C.2) on ˜hm. For a graph memory the structural form of the criterion (wherein the node is inserted only when it would form at least one edge to a node from a different episode/session) is the natural analogue of the embedding-side cosine check, and again reads no outcome variable

work page
[72]

customer rating above 4.3

Use ˜hm wherever the host system would have used hm. Any cosine or dot-product score the host computes between nodes is now computed on residualized vectors; edges formed by feature similarity automatically avoid connecting nodes whose only commonality was shared step-level context. The graph topology, edge types, and node text remain untouched: only the ...

work page