Recognition: 2 theorem links
· Lean TheoremThe Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
Pith reviewed 2026-05-12 04:39 UTC · model grok-4.3
The pith
Agentic memory improves LLM reasoning on clean inputs but amplifies spurious correlations when present in trajectories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that trajectory-level memory in agentic systems amplifies spurious correlations identified through causal structure, and that CAMEL, a plug-and-play calibration method operating at write and retrieval time, reduces reliance on these patterns across memory architectures while preserving or improving performance on clean inputs and remaining robust to adaptive attacks.
What carries the argument
CAMEL, a calibration method that operates across diverse memory architectures at both write and retrieval time to counteract spurious correlations.
If this is right
- Agentic memory systems without calibration will show increased error propagation when spurious patterns exist in stored trajectories.
- CAMEL can be added to existing memory architectures to lower spurious reliance while keeping gains on clean reasoning tasks.
- The calibration remains effective against adaptive attacks that target the mitigation process itself.
- Benchmarking via causal structure provides a diagnostic tool for identifying trajectory-level vulnerabilities before deployment.
Where Pith is reading between the lines
- This suggests memory mechanisms in long-running agents should include default calibration steps to prevent error accumulation across extended interactions.
- Similar calibration approaches could be tested on other forms of persistent state in AI systems beyond the three pattern types examined here.
- The findings point toward a need for ongoing monitoring of memory content in deployed agents to catch emerging spurious correlations not covered in initial benchmarks.
Load-bearing premise
The benchmark patterns identified through causal structure accurately represent the spurious correlations that arise in real deployed agentic memory systems.
What would settle it
An experiment applying CAMEL to a deployed agentic system with naturally occurring spurious correlations outside the three benchmarked causal types, then measuring whether spurious reliance still decreases without harming clean performance.
Figures
read the original abstract
Agentic memory enables LLMs to persist information beyond a single context window and reuse it in later decisions, but it also introduces a new vulnerability: spurious correlations, where retrieved memory carries miscorrelated evidence and propagates erroneous reasoning into downstream decisions. Despite the widespread use of agentic memory, this risk remains largely underexplored. We address it from two aspects. First, we benchmark several canonical types of spurious patterns identified through causal structure and record them across trajectory-level memory. Diagnosing agentic memory systems on this benchmark reveals that memory improves reasoning on clean inputs but amplifies reliance on spurious patterns when they are present. Second, we propose CAMEL, a plug-and-play calibration method that operates across diverse memory architectures at both write and retrieval time. CAMEL consistently reduces reliance on spurious patterns across all three types while preserving or improving performance on clean inputs and staying robust under adaptive attacks targeting the calibration. Overall, CAMEL offers a principled and lightweight solution toward more reliable agentic memory deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that agentic memory in LLMs improves reasoning on clean inputs but amplifies reliance on spurious correlations when present in trajectories. It benchmarks three canonical spurious pattern types identified via causal structure, and proposes CAMEL, a plug-and-play calibration method operating at write and retrieval time that reduces spurious reliance across patterns while preserving clean performance and remaining robust to adaptive attacks.
Significance. If the results hold, the work is significant for highlighting an underexplored vulnerability in widely used agentic memory systems and providing a lightweight, architecture-agnostic mitigation. Strengths include the empirical benchmarking across multiple pattern types and explicit robustness testing under adaptive attacks. However, the absence of experimental details and validation that synthetic patterns match real-world spurious correlations limits the strength of the conclusions.
major comments (2)
- Abstract: the abstract reports benchmark results and method effectiveness but provides no details on experimental design, statistical tests, or data construction. This is load-bearing for the central claims, as it prevents verification of whether memory amplification and CAMEL's reported gains are supported.
- Benchmark construction (implied in abstract): the central claim that memory amplifies spurious reliance while CAMEL mitigates it rests on three synthetic pattern types constructed by intervening on causal graphs in trajectories. The manuscript provides no direct evidence (e.g., observational studies on production traces or non-synthetic agent runs) that these patterns represent the distribution of spurious links that arise naturally from retrieval and reuse in deployed agentic memory systems.
minor comments (1)
- The three types of spurious patterns should be defined with explicit examples or causal diagrams in the main text for clarity, as the abstract refers to them without elaboration.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps clarify the presentation and scope of our work on spurious correlations in agentic memory systems. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: Abstract: the abstract reports benchmark results and method effectiveness but provides no details on experimental design, statistical tests, or data construction. This is load-bearing for the central claims, as it prevents verification of whether memory amplification and CAMEL's reported gains are supported.
Authors: We agree that the abstract would benefit from additional context to support verification of the claims. In the revised version, we will expand the abstract to briefly describe the benchmark construction via causal graph interventions on trajectories, the evaluation setup comparing clean versus spurious inputs, and note that performance metrics are reported as averages over multiple independent runs with standard deviations provided in the main text and appendix. Full details on data construction, metrics, and any statistical analyses remain in Sections 3 and 4. This change will make the central results more immediately verifiable while respecting abstract length constraints. revision: yes
-
Referee: Benchmark construction (implied in abstract): the central claim that memory amplifies spurious reliance while CAMEL mitigates it rests on three synthetic pattern types constructed by intervening on causal graphs in trajectories. The manuscript provides no direct evidence (e.g., observational studies on production traces or non-synthetic agent runs) that these patterns represent the distribution of spurious links that arise naturally from retrieval and reuse in deployed agentic memory systems.
Authors: We acknowledge that the benchmarks are synthetically constructed through targeted interventions on causal structures to isolate canonical spurious pattern types, as detailed in Section 3. This design enables rigorous, controlled measurement of memory amplification and CAMEL's mitigation effects, which would be difficult to isolate in uncontrolled real-world traces. We do not include observational studies from production agent runs in the current work. We will add an explicit limitations subsection in the Discussion to justify the synthetic approach, discuss its implications for generalizability, and suggest directions for future real-world validation studies. revision: partial
- Direct empirical validation that the three synthetic spurious patterns match the distribution of naturally occurring spurious correlations in deployed production agentic memory systems.
Circularity Check
No circularity: empirical benchmarking and calibration method are self-contained
full rationale
The paper conducts an empirical study by constructing synthetic benchmarks of spurious patterns via causal structures in trajectories, evaluating memory systems on them, and proposing the CAMEL calibration method. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The central claims rest on experimental outcomes rather than any chain that reduces by construction to the inputs. The benchmark patterns are presented as an evaluation tool, not as a self-defined result. This is a standard empirical setup with external falsifiability through the reported experiments and robustness tests, warranting a score of 0.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAMEL intervenes at write time by subtracting the step mean μ(s) from each memory embedding, retaining only content-specific signal; at retrieval time it tests stability under perturbations along non-causal directions.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents.arXiv preprint arXiv:2407.04363, 2024
-
[2]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization.arXiv preprint arXiv:1907.02893, 2019
work page internal anchor Pith review arXiv 1907
-
[3]
Sunil Arya, David M Mount, Nathan S Netanyahu, Ruth Silverman, and Angela Y Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions.Journal of the ACM (JACM), 45(6):891–923, 1998
work page 1998
-
[4]
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995
work page 1995
-
[5]
Chateval: Towards better llm-based evaluators through multi-agent debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. In International conference on learning representations, volume 2024, pages 9079–9093, 2024
work page 2024
-
[6]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the association for computational linguistics: ACL 2024, pages 2318–2335, 2024
work page 2024
-
[7]
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red- teaming llm agents via poisoning memory or knowledge bases.Advances in Neural Information Processing Systems, 37:130185–130213, 2024
work page 2024
-
[8]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025
Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. A practical memory injection attack against llm agents.arXiv e-prints, pages arXiv–2503, 2025
work page 2025
-
[10]
Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. Shortcut learning of large language models in natural language understanding.Communications of the ACM, 67(1):110– 120, 2023
work page 2023
-
[11]
Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. 10
work page 2020
-
[12]
Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, et al. Causalt5k: Diagnosing and informing refusal for trustworthy causal reasoning of skepticism, sycophancy, detection- correction, and rung collapse.arXiv preprint arXiv:2602.08939, 2026
-
[13]
Brian D Haig. What is a spurious correlation?Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2):125–132, 2003
work page 2003
-
[14]
Unsupervised Dense Information Retrieval with Contrastive Learning
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021
work page internal anchor Pith review arXiv 2021
-
[15]
Just train twice: Improving group robustness without training group information
Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. InInternational Conference on Machine Learning, pages 6781–6792. PMLR, 2021
work page 2021
-
[16]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
work page 2024
-
[17]
Xiaoqun Liu, Jiacheng Liang, Qiben Yan, Jiyong Jang, Sicheng Mao, Muchao Ye, Jinyuan Jia, and Zhaohan Xi. Cylens: Towards reinventing cyber threat intelligence in the paradigm of agentic large language models.arXiv preprint arXiv:2502.20791, 2025
-
[18]
How implicit bias accumulates and propagates in llm long-term memory
Yiming Ma, Lixu Wang, Lionel Z Wang, Hongkun Yang, Haoming Sun, Xin Xu, Jiaqi Wu, Bin Chen, and Wei Dong. How implicit bias accumulates and propagates in llm long-term memory. arXiv preprint arXiv:2602.01558, 2026
-
[19]
Evaluating very long-term conversational memory of llm agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024
work page 2024
-
[20]
Arakaparampil M Mathai.Jacobians of matrix transformations and functions of matrix argu- ments. World Scientific, 1997
work page 1997
-
[21]
Yuqiao Meng, Luoxi Tang, Feiyang Yu, Jinyuan Jia, Guanhua Yan, Ping Yang, and Zhao- han Xi. Uncovering vulnerabilities of llm-assisted cyber threat intelligence.arXiv preprint arXiv:2509.23573, 2025
-
[22]
Yuqiao Meng, Luoxi Tang, Feiyang Yu, Xi Li, Guanhua Yan, Ping Yang, and Zhaohan Xi. Benchmarking llm-assisted blue teaming via standardized threat hunting.arXiv preprint arXiv:2509.23571, 2025
-
[23]
Small agent group is the future of digital health.arXiv preprint arXiv:2602.08013, 2026
Yuqiao Meng, Luoxi Tang, Dazheng Zhang, Rafael Brens, Elvys J Romero, Nancy Guo, Safa Elkefi, and Zhaohan Xi. Small agent group is the future of digital health.arXiv preprint arXiv:2602.08013, 2026
-
[24]
Vrushket More, Lyra Lu, Zeyu Ding, Zhaohan Xi, Seth Mizia, and Nancy L Guo. Theramind: a multi-llm ensemble for accelerating drug repurposing in lung cancer via case report mining.npj Precision Oncology, 2026
work page 2026
-
[25]
Leland Gerson Neuberg. Causality: models, reasoning, and inference, by judea pearl, cambridge university press, 2000.Econometric Theory, 19(4):675–685, 2003
work page 2000
-
[26]
Understanding bias reinforcement in llm agents debate.arXiv preprint arXiv:2503.16814, 2025
Jihwan Oh, Minchan Jeong, Jongwoo Ko, and Se-Young Yun. Understanding bias reinforcement in llm agents debate.arXiv preprint arXiv:2503.16814, 2025
-
[27]
Memgpt: towards llms as operating systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023. 11
work page 2023
-
[28]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[29]
Judea Pearl.The book of why: The new science of cause and effect. Basic Books, 2018
work page 2018
-
[30]
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals.Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):947–1012, 2016
work page 2016
-
[31]
Causal discovery with continuous additive noise models
Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. 2014
work page 2014
-
[32]
Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization.arXiv preprint arXiv:1911.08731, 2019
work page internal anchor Pith review arXiv 1911
-
[33]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023
work page 2023
-
[34]
Recommendations as treatments: Debiasing learning and evaluation
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. Ininternational conference on machine learning, pages 1670–1679. PMLR, 2016
work page 2016
-
[35]
Causality for machine learning
Bernhard Schölkopf. Causality for machine learning. InProbabilistic and causal inference: The works of Judea Pearl, pages 765–804. 2022
work page 2022
-
[36]
Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021
work page 2021
-
[37]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023
work page 2023
-
[38]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020
work page internal anchor Pith review arXiv 2010
-
[39]
Causal invariance in reasoning and learning.Psychology of learning and motivation, 44:287–326, 2004
Steven Sloman and David A Lagnado. Causal invariance in reasoning and learning.Psychology of learning and motivation, 44:287–326, 2004
work page 2004
-
[40]
Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search. MIT press, 2000
work page 2000
-
[41]
Vignesh Sriram, Yuqiao Meng, Luoxi Tang, and Zhaohan Xi. Adversarial network imagination: Causal llms and digital twins for proactive telecom mitigation.arXiv preprint arXiv:2602.13203, 2026
-
[42]
Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025
Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025
-
[43]
Memory poisoning attack and defense on memory based llm-agents,
Balachandra Devarangadi Sunil, Isheeta Sinha, Piyush Maheshwari, Shantanu Todmal, Shreyan Mallik, and Shuchi Mishra. Memory poisoning attack and defense on memory based llm-agents. arXiv preprint arXiv:2601.05504, 2026
-
[44]
Luoxi Tang, Yuqiao Meng, Joseph Costa, Yingxue Zhang, Muchao Ye, and Zhaohan Xi. The value of variance: Mitigating debate collapse in multi-agent systems via uncertainty-driven policy optimization.arXiv preprint arXiv:2602.07186, 2026. 12
-
[45]
Luoxi Tang, Yuqiao Meng, Ankita Patra, Weicheng Ma, Muchao Ye, and Zhaohan Xi. Po- lar: Automating cyber threat prioritization through llm-powered assessment.arXiv preprint arXiv:2510.01552, 2025
-
[46]
Large language models can be lazy learners: Analyze shortcuts in in-context learning
Ruixiang Tang, Dehan Kong, Longtao Huang, et al. Large language models can be lazy learners: Analyze shortcuts in in-context learning. InFindings of the association for computational linguistics: ACL 2023, pages 4645–4657, 2023
work page 2023
-
[47]
Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough.Quantitative Science Studies, 1(1):396–413, 2020
work page 2020
-
[48]
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298, 2022
work page 2022
-
[49]
Badagent: Inserting and activating backdoor attacks in llm agents
Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. Badagent: Inserting and activating backdoor attacks in llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, 2024
work page 2024
-
[50]
Qianshan Wei, Tengchao Yang, Yaochen Wang, Xinfeng Li, Lijun Li, Zhenfei Yin, Yi Zhan, Thorsten Holz, Zhiqiang Lin, and XiaoFeng Wang. A-memguard: A proactive defense frame- work for llm-based agent memory.arXiv preprint arXiv:2510.02373, 2025
-
[51]
Berkson’s bias, selection bias, and missing data.Epidemiology, 23(1):159– 164, 2012
Daniel Westreich. Berkson’s bias, selection bias, and missing data.Epidemiology, 23(1):159– 164, 2012
work page 2012
-
[52]
Junda Wu, Tong Yu, Xiang Chen, Haoliang Wang, Ryan Rossi, Sungchul Kim, Anup Rao, and Julian McAuley. Decot: Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14073–14087, 2024
work page 2024
-
[53]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024
work page 2024
-
[54]
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs, April 2025
Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of llms.arXiv preprint arXiv:2504.15965, 2025
-
[55]
All your knowledge belongs to us: Stealing knowledge graphs via reasoning apis
Zhaohan Xi. All your knowledge belongs to us: Stealing knowledge graphs via reasoning apis. arXiv preprint arXiv:2503.09727, 2025
-
[56]
Guandong Xu, Tri Dung Duong, Qian Li, Shaowu Liu, and Xianzhi Wang. Causality learning: A new perspective for interpretable machine learning.arXiv preprint arXiv:2006.16789, 2020
-
[57]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Shuai Yang, Qi Yang, Luoxi Tang, Yuqiao Meng, Nancy Guo, Jeremy Blackburn, and Zhaohan Xi. On the eligibility of llms for counterfactual reasoning: a decompositional study.arXiv preprint arXiv:2505.11839, 2025
-
[59]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022
work page 2022
-
[60]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Spurious correlations in machine learning: A survey.arXiv e-prints, pages arXiv–2402, 2024
Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang. Spurious correlations in machine learning: A survey.arXiv e-prints, pages arXiv–2402, 2024. 13
work page 2024
-
[62]
G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025
Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025
-
[63]
A survey on the memory mechanism of large language model-based agents
Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025
work page 2025
-
[64]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024
work page 2024
-
[65]
Zodiac: A cardiologist-level llm framework for multi-agent diagnostics,
Yuan Zhou, Peng Zhang, Mengya Song, Alice Zheng, Yiwen Lu, Zhiheng Liu, Yong Chen, and Zhaohan Xi. Zodiac: A cardiologist-level llm framework for multi-agent diagnostics.arXiv preprint arXiv:2410.02026, 2024. A Validation of Identified Spurious Correlations Causal discovery from observational data is inherently uncertain, so we validate each identified sp...
-
[66]
Compute ˜hm =h m −µ (s) using thecurrentmean (the mean over the n entries already written in this step)
-
[67]
Store ˜hm in the ANN index
-
[68]
causally relevant to the query
Updateµ (s) ←µ (s) + 1 n+1(hm −µ (s)), then incrementn←n+ 1. The order matters: residualization uses the pre-update mean, so ˜hm measures how m departs from the context already established by earlier entries in the step. Reversing steps 1 and 3 would subtract hm partly from itself. Counter and state.The counter n tracks entries written in the current step...
-
[69]
Obtainh m via Appendix D.1 (existing embedding or fresh text encoding)
-
[70]
(2) to get ˜hm and update µ(s)
Apply Eq. (2) to get ˜hm and update µ(s). Implementation details (initialization, ordering, step closure) are in Appendix C.3
-
[71]
Apply the content-novelty write criterion (Appendix C.2) on ˜hm. For a graph memory the structural form of the criterion (wherein the node is inserted only when it would form at least one edge to a node from a different episode/session) is the natural analogue of the embedding-side cosine check, and again reads no outcome variable
-
[72]
Use ˜hm wherever the host system would have used hm. Any cosine or dot-product score the host computes between nodes is now computed on residualized vectors; edges formed by feature similarity automatically avoid connecting nodes whose only commonality was shared step-level context. The graph topology, edge types, and node text remain untouched: only the ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.