Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
hub Canonical reference
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti -- a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresse
co-cited works
representative citing papers
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
MemSyco-Bench is a benchmark covering five tasks to evaluate memory-induced sycophancy in LLM agents, testing rejection of invalid memory, scope respect, conflict resolution, update tracking, and valid personalization.
TRACE models conversations as hierarchical graphs with temporal, causal, update, and contradiction relations plus validity annotations to enable bounded, state-aware query processing over long conversational histories.
An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.
DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.
PersonaTree is a new hierarchical memory framework for persistent LLM agents that structures evidence into persona claims via support paths and outperforms baselines on six person-understanding benchmarks.
eMEM is a multi-index memory architecture with tiered consolidation and ten recall tools for embodied agents, scoring 80.8 weighted mean on eMEM-Bench covering eight cognitive psychology paradigms and outperforming a flat RAG baseline on context and lure rejection tasks.
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
MemMark enables snapshot-only attribution for agent long-term memory by embedding signals via keyed distribution-preserving sampling at memory-write decisions, recovering 40-bit payloads with near-baseline utility.
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
Introduces PrecisionMemBench, an 89-case benchmark for isolated retrieval precision in LLM memory systems, and Tenure, a structured store achieving 1.0 mean precision on all cases.
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipping as plugins and servers with an audit log.
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
citing papers explorer
-
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.
-
MemSyco-Bench: Benchmarking Sycophancy in Agent Memory
MemSyco-Bench is a benchmark covering five tasks to evaluate memory-induced sycophancy in LLM agents, testing rejection of invalid memory, scope respect, conflict resolution, update tracking, and valid personalization.
-
TRACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data
TRACE models conversations as hierarchical graphs with temporal, causal, update, and contradiction relations plus validity annotations to enable bounded, state-aware query processing over long conversational histories.
-
Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations
An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.
-
Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents
DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.
-
Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
-
TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.
-
PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents
PersonaTree is a new hierarchical memory framework for persistent LLM agents that structures evidence into persona claims via support paths and outperforms baselines on six person-understanding benchmarks.
-
eMEM: A Hybrid Spatio-Temporal Memory System For Embodied Agents
eMEM is a multi-index memory architecture with tiered consolidation and ten recall tools for embodied agents, scoring 80.8 weighted mean on eMEM-Bench covering eight cognitive psychology paradigms and outperforming a flat RAG baseline on context and lure rejection tasks.
-
Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction
MemPoison enables stealthy memory poisoning in LLM agents via dialogue by using semantic relational bridges, entity masquerading, and joint embedding optimization to bypass selective extraction and rewriting, achieving up to 0.95 attack success rate.
-
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
-
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
-
MemMark: State-Evolution Attribution Watermarking for Agent Long-Term Memory Systems
MemMark enables snapshot-only attribution for agent long-term memory by embedding signals via keyed distribution-preserving sampling at memory-write decisions, recovering 40-bit payloads with near-baseline utility.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
-
MEME: Multi-entity & Evolving Memory Evaluation
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
-
Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval
Introduces PrecisionMemBench, an 89-case benchmark for isolated retrieval precision in LLM memory systems, and Tenure, a structured store achieving 1.0 mean precision on all cases.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
-
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipping as plugins and servers with an audit log.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
-
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BEIR datasets.
-
SAGER: Self-Evolving User Policy Skills for Recommendation Agent
SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to memory accumulation.
-
AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
-
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
-
MIRIX: Multi-Agent Memory System for LLM-Based Agents
MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
-
Experience Graphs: The Data Foundation for Self-Improving Agents
Trellis treats agent experience graphs as first-class database state so that search patterns become queries, enabling crash recovery, scaling, and closed-loop training as architectural byproducts.
-
MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory
MemLeak benchmark shows retained images enable 12% recovery of deleted facts in multimodal agents (reduced to 2% with content-aware deletion), with 47% of image leaks not text-recoverable.
-
Mandol: An Agglomerative Agent Memory System for Long-Term Conversations
Mandol unifies memory storage and retrieval into an agglomerative semantic graph architecture with quantitative query mechanisms, reporting best accuracy on LoCoMo and LongMemEval plus 5.4x retrieval and 4.8x insertion speedups.
-
Selection Integrity for LLM Graph Memory: An Accumulability Criterion for Information-Flow-Blind Retrieval
Provenance checks in graph memory are blind to structural attacks that reallocate top-k membership; authselect prevents this by enforcing selection on the authenticated subgraph only.
-
Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering
DocTrace uses a document structural tree index, on-demand hypergraph working memory, and graph-structured experience memory to outperform ComoRAG by up to 8.85% F1 and 4.40% EM on three of four long-document QA datasets while cutting compute by 53.32%.
-
FLOWREADER: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A
FLOWREADER reframes multimodal long-document evidence assembly as min-cost flow optimization and reports competitive results on VisDoMBench subsets with fragmented evidence.
-
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Engram's hybrid bi-temporal retrieval from a knowledge graph with provenance yields 83.6% accuracy on LongMemEval_S using 9.6k tokens versus 73.2% with full history.
-
Beyond Similarity: Trustworthy Memory Search for Personal AI Agents
MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.
-
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents
MRAgent combines a Cue-Tag-Content associative graph with active reconstruction to enable dynamic memory access in LLM agents, reporting up to 23% gains on long-memory benchmarks with lower token costs.
-
Memory Shot for Long-Term Dialogue
MemShot renders local dialogue spans as structured visual memory units to improve long-term dialogue modeling in LLMs, achieving competitive benchmark performance with 70x faster memory construction.
-
Rethinking Memory as Continuously Evolving Connectivity
FluxMem evolves memory as a heterogeneous graph via three refinement stages and reports consistent state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.
-
MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models
MemGuard assigns functional roles to memories at write time and selectively retrieves only compatible types, reducing heterogeneous contamination and improving reliability by up to 28.27% with 5.8x fewer tokens.
-
Simulating Human Memory with Language Models
Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.
-
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.
-
The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
ActiveGraph inverts traditional agent frameworks by treating the append-only event log as the primary source of truth, from which the reactive graph is projected, yielding deterministic replay, forking, and lineage tracking.
-
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.
-
Agentic Recommender System with Hierarchical Belief-State Memory
MARS uses hierarchical event-preference-profile memory with an LLM-scheduled lifecycle of six operations to achieve state-of-the-art results on InstructRec benchmarks.
-
Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations
A hybrid LLM-symbolic verifier maintains a dependency graph over conversation turns classified into eight formal update operations, enabling linear-time groundedness checks and precise retraction propagation with a conflict-free guarantee.
-
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.
-
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
PRISM is a new inference-time retrieval system that achieves higher accuracy than baselines on long-horizon agent tasks while using an order of magnitude less context by combining hierarchical graph search, intent-based costing, compression, and adaptive routing over structured memory.