A-MEM: Agentic Memory for LLM Agents
Pith reviewed 2026-05-11 00:40 UTC · model grok-4.3
The pith
An agentic memory system lets LLM agents dynamically index, link, and evolve interconnected knowledge networks from their experiences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an agentic memory system, by generating notes with contextual descriptions, keywords, and tags for each new memory and then analyzing historical memories to establish meaningful links and trigger updates to existing entries, produces an evolving interconnected knowledge network that improves agent performance on complex tasks.
What carries the argument
The agentic memory process that creates structured notes and performs dynamic similarity-based linking together with evolution updates to prior memories.
If this is right
- Agents gain adaptability across diverse tasks because memory organization is no longer limited to fixed operations and structures.
- Historical experiences become more usable as new memories trigger refinements to the contextual representations of older ones.
- The memory network continuously evolves rather than remaining static, supporting longer-term task sequences.
- Performance gains appear consistently across multiple foundation models when compared with prior state-of-the-art memory systems.
Where Pith is reading between the lines
- Agents using this memory approach could maintain coherence over hundreds of steps without external human intervention to correct memory errors.
- The same linking mechanism might be applied to multi-agent settings where separate agents share and evolve a joint memory network.
- Efficiency questions arise for very large memory collections, where the cost of repeated similarity analysis could become a bottleneck.
Load-bearing premise
The underlying LLM must reliably produce accurate contextual descriptions, keywords, tags, and meaningful links without introducing errors or hallucinations that degrade the overall memory network.
What would settle it
Measure task performance on the six foundation models when the system is used versus when fixed memory baselines are used; if no consistent improvement appears, or if incorrect links cause measurable degradation over long sequences, the central claim does not hold.
read the original abstract
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes A-MEM, an agentic memory system for LLM agents inspired by Zettelkasten principles. It dynamically creates structured notes (contextual descriptions, keywords, tags) for new memories via LLM prompting, identifies links to historical memories, and enables memory evolution by updating prior entries' representations as new information integrates. This forms an evolving interconnected knowledge network. The central claim is that this yields superior performance improvements over existing SOTA baselines across experiments on six foundation models, with source code released at two GitHub repositories.
Significance. If the empirical gains are robust and the memory network remains stable, the work could meaningfully advance memory systems for LLM agents by enabling adaptive, context-aware organization beyond fixed retrieval or static graphs. The explicit release of both evaluation and system code is a clear strength that supports reproducibility and follow-on work.
major comments (2)
- [Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.
- [§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.
minor comments (2)
- [Abstract] The abstract mentions 'recent attempts to incorporate graph databases' but does not cite specific prior systems; adding 1-2 concrete references would clarify the positioning.
- [§3] Figure captions and algorithm pseudocode (if present in §3) could more explicitly label the LLM prompting steps versus the graph-update steps to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript describing A-MEM. The comments highlight important areas for strengthening the presentation of our empirical results and the reliability of the memory operations. We address each major comment below and have revised the manuscript accordingly to improve clarity, completeness, and rigor.
read point-by-point responses
-
Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.
Authors: We agree that the abstract is high-level and does not enumerate experimental details. The Experiments section (Section 4) does describe the six foundation models, task benchmarks (agentic QA, tool-use, and multi-step reasoning tasks), SOTA baselines (including fixed-retrieval and graph-memory systems), metrics (success rate, latency, and memory efficiency), and variance controls via repeated runs with different seeds. However, we acknowledge that statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with reported p-values) and more explicit baseline implementation details were not sufficiently highlighted. In the revised manuscript we will (1) update the abstract with a concise sentence on the evaluation framework and (2) add a dedicated “Experimental Setup” subsection that includes all requested elements plus significance tests. These changes directly support the central empirical claim. revision: yes
-
Referee: [§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.
Authors: We recognize this as a substantive limitation. The original submission emphasizes end-to-end task performance and does not report direct fidelity measurements on the LLM-generated notes or links. In the revised version we will insert a new subsection (under Section 3 or 4) that presents quantitative validation: human evaluation on 200 randomly sampled memories measuring (a) accuracy of contextual descriptions, (b) relevance of keywords and tags, and (c) precision/recall of generated links. We will also report an error-propagation analysis by tracking how often an erroneous update affects downstream retrieval. These additions will allow readers to assess the robustness of the closed-loop evolution process. revision: yes
Circularity Check
No circularity: empirical system proposal without derivational reductions
full rationale
The paper presents a design for an agentic memory system that generates structured notes, identifies links, and evolves prior entries via LLM prompts, explicitly following Zettelkasten principles. No equations, fitted parameters, uniqueness theorems, or mathematical derivations appear in the abstract or description. All performance claims rest on external empirical experiments across six models against SOTA baselines rather than any internal self-definition, prediction-from-fit, or self-citation chain that reduces the central result to its own inputs by construction. The system is therefore self-contained as an engineering proposal evaluated externally.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM agents require sophisticated memory organization beyond basic storage and retrieval to handle complex tasks effectively.
- domain assumption Dynamic indexing, linking, and evolution of memories will produce an adaptive knowledge network superior to fixed-structure systems.
invented entities (1)
-
Agentic memory network with evolving links
no independent evidence
Lean theorems connected to this paper
-
Foundation.LedgerForcingconservation_from_balance echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist.
-
Foundation.LedgerForcingadd_event_balanced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Additionally, this process enables memory evolution – as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding.
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
-
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
-
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
-
MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts
MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.
-
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
-
EXG: Self-Evolving Agents with Experience Graphs
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
-
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
-
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
-
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
AEL: Agent Evolving Learning for Open-Ended Environments
AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechan...
-
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
-
When to Forget: A Memory Governance Primitive
Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
-
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams
SensorPersona uses LLMs for hierarchical reasoning on longitudinal mobile sensor streams to continually extract stable personas, showing up to 31.4% higher recall and 85.7% win rate over baselines on a 20-user dataset.
-
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
$How^{2}$: How to learn from procedural How-to questions
$How^{2}$ is a memory agent framework enabling agents to ask, store, and reuse answers to how-to questions at varying abstraction levels for better lifelong planning in environments like Plancraft.
-
MIRIX: Multi-Agent Memory System for LLM-Based Agents
MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
-
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods ...
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
-
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
-
Self-Evolving Multi-Agent Systems via Decentralized Memory
DecentMem is a decentralized dual-pool memory framework for self-evolving multi-agent systems that provides O(log T) regret guarantees and yields up to 23.8% accuracy gains over centralized baselines.
-
EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning
EvoIR-Agent formulates experience components into a hierarchical pool with a self-evolving update mechanism to improve performance and efficiency of training-free MLLM image restoration agents over prior paradigms.
-
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
-
MedExpMem: Adapting Experience Memory for Differential Diagnosis
MedExpMem lets VLM diagnostic agents store and retrieve experience from past failures as pairwise differential notes, producing up to 7% accuracy gains on a multi-subspecialty radiology benchmark.
-
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
-
State Contamination in Memory-Augmented LLM Agents
Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
-
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
DimMem introduces a dimensional memory framework that structures memories as typed atomic units to improve retrieval efficiency and accuracy for long-term LLM agent tasks.
-
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.
-
Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
SeqMem-Eval reveals that high final accuracy in sequential LLM memory tasks often coexists with substantial forgetting and negative transfer, exposing stability-adaptability trade-offs hidden by standard aggregate metrics.
Reference graph
Works this paper leans on
-
[1]
Sönke Ahrens.How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking. Amazon, 2017. Second Edition
work page 2017
-
[2]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku. Anthropic, Mar 2024. Accessed May 2025
work page 2024
-
[3]
Claude 3.5 sonnet model card addendum
Anthropic. Claude 3.5 sonnet model card addendum. Technical report, Anthropic, 2025. Accessed May 2025
work page 2025
-
[4]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023
work page internal anchor Pith review arXiv 2023
-
[5]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005
work page 2005
-
[6]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR, 2022
work page 2022
-
[7]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
work page 2023
-
[8]
mem0: The memory layer for ai agents
Khant Dev and Singh Taranjeet. mem0: The memory layer for ai agents. https://github. com/mem0ai/mem0, 2024
work page 2024
-
[9]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
I. Ilin. Advanced rag techniques: An illustrated overview, 2023
work page 2023
-
[13]
Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023
-
[14]
Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023
-
[15]
David Kadavy.Digital Zettelkasten: Principles, Methods, & Examples. Google Books, May 2021
work page 2021
-
[16]
arXiv preprint arXiv:2406.13144 , year=
Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term multi-party dialogue understanding of conversational agents.arXiv preprint arXiv:2406.13144, 2024
-
[17]
A human- inspired reading agent with gist memory of very long contexts,
Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024. 10
-
[18]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[19]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
work page 2004
-
[20]
Ra-dit: Retrieval- augmented dual instruction tuning,
Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352, 2023
-
[21]
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K Choubey, Tian Lan, Jason Wu, Huan Wang, et al. Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024
-
[22]
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Aios: Llm agent operating system.arXiv e-prints, pp
Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system.arXiv e-prints, pp. arXiv–2403, 2024
work page 2024
-
[24]
Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023
-
[25]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[27]
Sentence-bert: Sentence embeddings using siamese bert- networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019
work page 2019
-
[28]
‘smolagents‘: a smol library to build great agentic systems
Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025
work page 2025
-
[29]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023
-
[30]
Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024
-
[31]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022
work page internal anchor Pith review arXiv 2022
-
[32]
Enhancing large language model with self-controlled memory framework
Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343, 2023
-
[33]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024
work page internal anchor Pith review arXiv 2024
-
[34]
Learning to filter context for retrieval-augmented generation,
Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. Learning to filter context for retrieval-augmented generation.arXiv preprint arXiv:2311.08377, 2023. 11
-
[35]
Llm-powered autonomous agents.lilianweng.github.io, Jun 2023
Lilian Weng. Llm-powered autonomous agents.lilianweng.github.io, Jun 2023
work page 2023
-
[36]
Beyond Goldfish Memory: Long-Term Open-Domain Conversation
J Xu. Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021
-
[37]
Chain-of-note: Enhancing robustness in retrieval-augmented language models,
Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models.arXiv preprint arXiv:2311.09210, 2023
-
[38]
Augmentation-adapted retriever improves generalization of language models as generic plug-in,
Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in.arXiv preprint arXiv:2305.17331, 2023
-
[39]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. 12 Contents 1 Introduction 1 2 Related Work 2 2.1 Memory for LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
work page 2024
-
[40]
Identifying the most salient keywords (focus on nouns, verbs, and key concepts)
-
[41]
Extracting core themes and contextual elements
-
[42]
Creating relevant categorical tags Format the response as a JSON object: { "keywords": [ // several specific, distinct keywords that capture key concepts and terminology // Order from most to least important // Don’t include keywords that are the name of the speaker or time // At least three keywords, but don’t be too redundant. ], "context": // one sente...
-
[43]
What specific actions should be taken (strengthen, update_neighbor)? 1.1 If choose to strengthen the connection, which memory should it be connected to? Can you give the updated tags of this memory? 1.2 If choose to update neighbor, you can update the context and tags of these memories based on the understanding of these memories. Tags should be determine...
work page 2023
-
[44]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.