Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Daniel Chalef; Jack Ryan; Pavlo Paliychuk; Preston Rasmussen; Travis Beauvais

arxiv: 2501.13956 · v1 · submitted 2025-01-20 · 💻 cs.CL · cs.AI· cs.IR

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen , Pavlo Paliychuk , Travis Beauvais , Jack Ryan , Daniel Chalef This is my paper

Pith reviewed 2026-05-11 10:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords temporal knowledge graphAI agent memoryknowledge graph engineretrieval augmented generationLLM agentsdynamic data integrationtemporal reasoning

0 comments

The pith

A temporal knowledge graph lets AI agents dynamically integrate conversational and business data for better memory performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Zep as a memory service for AI agents that uses a temporal knowledge graph to handle ongoing data from conversations and structured sources. Existing systems are limited to static retrieval, but enterprise applications need to track how information evolves over time across sessions. By maintaining historical relationships in the graph, Zep aims to support complex reasoning about past events while keeping response times low. This matters because it could make reliable long-term agent behavior possible in business settings where data is constantly updated.

Core claim

Zep employs Graphiti, a temporally-aware knowledge graph engine, to dynamically synthesize unstructured conversational data and structured business data while preserving their temporal relationships, resulting in improved accuracy on temporal reasoning benchmarks compared to previous approaches.

What carries the argument

Graphiti is a temporally-aware knowledge graph engine that builds and queries a structure containing time-stamped facts and relationships from both text conversations and database records.

If this is right

Agents can synthesize information across multiple sessions with higher accuracy.
Latency for responses involving long-term context drops significantly.
Dynamic updates to knowledge from new conversations and business data are handled without full recomputation.
Enterprise tasks requiring cross-source temporal reasoning become more reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar temporal graph techniques might apply to other AI systems needing to track evolving knowledge, such as in scientific data analysis.
Combining this with other memory architectures could lead to hybrid systems for even more complex agent behaviors.
Real-world testing in production environments would be needed to confirm if the benchmark gains persist under variable data conditions.

Load-bearing premise

Benchmarks focused on memory retrieval and temporal reasoning accurately predict performance in actual enterprise agent applications with diverse and changing data sources.

What would settle it

A direct comparison on a held-out enterprise dataset with frequent data updates and long conversation histories where Zep fails to show accuracy or latency improvements over baselines.

read the original abstract

We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti -- a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep's capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep's effectiveness for deployment in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Zep adds a temporal knowledge graph layer for agent memory and reports modest benchmark gains over MemGPT, but the numbers are point estimates without controls or variance.

read the letter

The main thing here is that Zep builds a memory service around Graphiti, a temporal knowledge graph that pulls in both ongoing conversations and structured business data while tracking history and relationships. It claims to beat MemGPT on the DMR benchmark and to do better on LongMemEval with lower latency. That core idea addresses a practical gap: most current agent memory is either static retrieval or simple buffers that lose temporal structure across sessions. The architecture that synthesizes unstructured and structured inputs over time is a straightforward extension of existing graph and RAG work, and it is presented clearly enough to understand what they tried. The choice of LongMemEval as a harder test that includes cross-session reasoning is also reasonable and more relevant to enterprise use than basic retrieval scores. The soft spots sit in the results. The DMR improvement is 1.4 points with no error bars, run counts, or significance tests, and the LongMemEval gains are given as “up to 18.5%” accuracy and “90%” latency reduction without task breakdowns or variance. Small margins on these benchmarks are sensitive to prompt wording and retrieval settings, so it is not yet clear that the temporal synthesis is the causal driver rather than other implementation choices. The abstract also gives no detail on baseline configurations or how the comparisons were run. If the full paper supplies code, repeated trials, and ablations, those gaps could close; otherwise the evidence remains thin. This work is aimed at engineers building production agents who need better long-term context than current RAG or MemGPT setups. A practitioner could extract useful design patterns from the Graphiti description even if the numbers need independent checking. I would send it to peer review because the problem is timely and the approach is a concrete attempt to move past static memory, though any referee would rightly ask for tighter experimental reporting.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Zep, a memory layer service for LLM-based agents featuring Graphiti, a temporally-aware knowledge graph engine that dynamically integrates unstructured conversational data with structured business data while preserving historical relationships. It claims to outperform the prior state-of-the-art MemGPT on the Deep Memory Retrieval (DMR) benchmark (94.8% vs. 93.4% accuracy) and to deliver up to 18.5% higher accuracy together with 90% lower response latency on the more challenging LongMemEval benchmark, with particular gains in cross-session synthesis and long-term context maintenance.

Significance. If the reported gains are robust, the work would constitute a practical advance in agent memory architectures for enterprise settings that require ongoing temporal reasoning over mixed conversational and structured sources, moving beyond static RAG limitations.

major comments (2)

Abstract: The headline performance claims consist solely of point estimates (94.8% vs 93.4% on DMR; up to 18.5% accuracy and 90% latency on LongMemEval) with no accompanying information on the number of runs, standard deviations, confidence intervals, baseline hyperparameter settings, or statistical significance tests. Without these controls, it is impossible to determine whether the observed margins exceed experimental noise, which is known to be high on retrieval benchmarks sensitive to prompt phrasing and retrieval parameters.
Abstract / Evaluation section: The manuscript asserts that LongMemEval 'better reflects enterprise use cases' and that the reported gains are 'particularly pronounced in enterprise-critical tasks,' yet provides no explicit justification, task breakdown, or ablation showing that the temporal synthesis performed by Graphiti is the causal factor rather than other implementation differences.

minor comments (1)

The abstract introduces Graphiti without a concise one-sentence definition of its core data model or update mechanism before stating its performance advantages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the evaluation presentation and benchmark justification. We address each major comment below and will revise the manuscript to strengthen these aspects.

read point-by-point responses

Referee: Abstract: The headline performance claims consist solely of point estimates (94.8% vs 93.4% on DMR; up to 18.5% accuracy and 90% latency on LongMemEval) with no accompanying information on the number of runs, standard deviations, confidence intervals, baseline hyperparameter settings, or statistical significance tests. Without these controls, it is impossible to determine whether the observed margins exceed experimental noise, which is known to be high on retrieval benchmarks sensitive to prompt phrasing and retrieval parameters.

Authors: We agree that additional statistical context would improve interpretability of the results. In the revised manuscript, we will report the number of evaluation runs, standard deviations for accuracy and latency metrics, baseline hyperparameter settings, and a brief discussion of result stability across runs. Formal statistical significance testing was not performed in the original experiments due to the focus on practical deployment metrics, but the observed margins remained consistent; we will note this limitation explicitly. revision: yes
Referee: Abstract / Evaluation section: The manuscript asserts that LongMemEval 'better reflects enterprise use cases' and that the reported gains are 'particularly pronounced in enterprise-critical tasks,' yet provides no explicit justification, task breakdown, or ablation showing that the temporal synthesis performed by Graphiti is the causal factor rather than other implementation differences.

Authors: We acknowledge the need for clearer justification. The revised manuscript will include a task breakdown of LongMemEval, grouping tasks by temporal reasoning requirements such as cross-session synthesis and long-term context maintenance. We will also add an ablation comparing Graphiti with its temporal components disabled, which isolates the contribution of temporal knowledge graph synthesis to the accuracy gains on these tasks and supports the claim that LongMemEval better captures enterprise temporal reasoning needs. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; claims are purely empirical benchmark comparisons.

full rationale

The paper describes an architecture (Zep/Graphiti) for temporal knowledge graph memory in agents and reports direct performance numbers on DMR (94.8% vs 93.4%) and LongMemEval (up to 18.5% accuracy gain, 90% latency reduction). No equations, first-principles derivations, fitted parameters, uniqueness theorems, or self-citation load-bearing steps appear in the provided text or abstract. All central claims reduce to external benchmark runs against independent baselines (MemGPT), with no internal reduction to the paper's own inputs or prior self-work. This is the expected non-circular outcome for a systems/engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the performance of a newly introduced system component with no free parameters, axioms, or formal derivations specified in the abstract.

invented entities (1)

Graphiti no independent evidence
purpose: temporally-aware knowledge graph engine that dynamically synthesizes unstructured conversational data and structured business data while maintaining historical relationships
New system component introduced as the core of Zep without external validation or formal definition provided in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1083 out tokens · 67375 ms · 2026-05-11T10:57:19.413779+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
Agentic Recommender System with Hierarchical Belief-State Memory
cs.CL 2026-05 unverdicted novelty 7.0

MARS uses hierarchical memory and LLM planning to achieve 26.4% higher HR@1 on InstructRec benchmarks compared to prior methods.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 unverdicted novelty 7.0

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
MEME: Multi-entity & Evolving Memory Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
cs.AI 2026-05 unverdicted novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of basel...
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
cs.IR 2026-04 conditional novelty 7.0

vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
SAGER: Self-Evolving User Policy Skills for Recommendation Agent
cs.IR 2026-04 unverdicted novelty 7.0

SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to mem...
AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification
cs.SE 2026-02 conditional novelty 7.0

AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
cs.CL 2025-11 unverdicted novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
MIRIX: Multi-Agent Memory System for LLM-Based Agents
cs.CL 2025-07 unverdicted novelty 7.0

MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
cs.CL 2025-07 unverdicted novelty 7.0

MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods ...
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
cs.CL 2026-05 unverdicted novelty 6.0

Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
cs.AI 2026-05 unverdicted novelty 6.0

ActiveGraph inverts traditional agent frameworks by treating the append-only event log as the primary source of truth, from which the reactive graph is projected, yielding deterministic replay, forking, and lineage tracking.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
cs.CL 2026-05 unverdicted novelty 6.0

H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.
Agentic Recommender System with Hierarchical Belief-State Memory
cs.CL 2026-05 unverdicted novelty 6.0

MARS uses hierarchical event-preference-profile memory with an LLM-scheduled lifecycle of six operations to achieve state-of-the-art results on InstructRec benchmarks.
Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations
cs.AI 2026-05 conditional novelty 6.0

A hybrid LLM-symbolic verifier maintains a dependency graph over conversation turns classified into eight formal update operations, enabling linear-time groundedness checks and precise retraction propagation with a co...
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
cs.CL 2026-05 unverdicted novelty 6.0

PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.
CogniFold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM is a new inference-time retrieval system that achieves higher accuracy than baselines on long-horizon agent tasks while using an order of magnitude less context by combining hierarchical graph search, intent-bas...
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
cs.CL 2026-05 conditional novelty 6.0

Intent-aware retrieval over assertion-labeled knowledge graphs improves clinical QA accuracy by 22 percentage points on a new MIMIC-IV benchmark that stresses negation, temporality, and attribution.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
GASim: A Graph-Accelerated Hybrid Framework for Social Simulation
cs.AI 2026-05 unverdicted novelty 6.0

GASim accelerates hybrid LLM-ABM social simulations via graph-optimized memory, graph message passing, and entropy-driven agent grouping, delivering 9.94x speedup and under 20% token use while aligning with real-world trends.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 6.0

The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
cs.AI 2026-04 unverdicted novelty 6.0

Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
cs.AI 2026-04 unverdicted novelty 6.0

GAM decouples event-level memory encoding from topic-level consolidation in LLM agents using hierarchical graphs to reduce interference and improve long-term coherence and retrieval.
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
cs.CL 2026-04 unverdicted novelty 6.0

TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing t...
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
cs.CL 2026-04 unverdicted novelty 6.0

MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.
Task-Adaptive Retrieval over Agentic Multi-Modal Web Histories via Learned Graph Memory
cs.IR 2026-04 unverdicted novelty 6.0

ACGM learns task-adaptive sparse graphs over multi-modal agent histories via policy-gradient optimization, reaching 82.7 nDCG@10 and 89.2% Precision@10 on WebShop, VisualWebArena, and Mind2Web while outperforming 19 b...
HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues
cs.CL 2026-04 unverdicted novelty 6.0

HingeMem segments dialogue memory via boundary-triggered hyperedges over four elements and applies query-adaptive retrieval, yielding ~20% relative gains and 68% lower QA token cost versus baselines on LOCOMO.
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
cs.CV 2026-04 unverdicted novelty 6.0

FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
Opal: Private Memory for Personal AI
cs.CR 2026-04 unverdicted novelty 6.0

Opal enables private long-term memory for personal AI by decoupling reasoning to a trusted enclave with a lightweight knowledge graph and piggybacking reindexing on ORAM accesses.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling
cs.AI 2026-02 unverdicted novelty 6.0

HyMem introduces dual-granular memory storage with a lightweight summary module for fast responses and selective activation of a deep LLM module for complex queries, outperforming full-context baselines by 92.6% lower...
Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents
cs.LG 2026-05 unverdicted novelty 5.0

Memory-R2 proposes LoGo-GRPO to fix unfair trajectory comparisons in RL training of memory-augmented LLM agents by combining global end-to-end rewards with local rerollouts from identical memory states.
PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning
cs.MA 2026-05 unverdicted novelty 5.0

PyraVid is a hierarchical multimodal memory system that structures long videos into pyramids to improve long-horizon reasoning and evidence aggregation.
GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory
cs.CL 2026-05 unverdicted novelty 5.0

GRAVITY adds structured relational, temporal, and thematic memory anchors to conversational LLMs at generation time, delivering 7.5-10.1% average gains in LLM-judge accuracy across five host systems on LongMemEval and LoCoMo.
EgoSelf: From Memory to Personalized Egocentric Assistant
cs.CV 2026-04 unverdicted novelty 5.0

EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
cs.SE 2026-04 unverdicted novelty 5.0

Agentic Consensus replaces code as the main artifact with a typed property graph world model that maintains commitments and evidence through synchronization operators, shifting evaluation to alignment fidelity and con...
The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward
cs.AI 2026-04 unverdicted novelty 5.0

AI intelligence is limited by the lack of an architecture that carries forward understanding across sessions, and the proposed continuity layer with Decomposed Trace Convergence Memory addresses this by enabling persi...
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
cs.AI 2026-04 unverdicted novelty 5.0

Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure
cs.AI 2026-04 unverdicted novelty 5.0

OIDA is a proposed framework that represents organizational knowledge as epistemic Knowledge Objects with class-specific importance decay and signed contradictions, plus a QUESTION mechanism that surfaces modeled igno...
Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure
cs.AI 2026-04 unverdicted novelty 5.0

OIDA adds typed knowledge objects, decay-based importance scores, contradiction edges, and an inverse-decay QUESTION primitive for ignorance to raise epistemic fidelity beyond retrieval.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT transforms long-context LLM reasoning into an iterative stateful search using multi-view memory for evidence localization and dual short-term memory for guiding decisions, achieving SOTA on LoCoMo and LongMemEv...
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens...

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 58 Pith papers

[1]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkor eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023
[2]

Sparck Jones

K. Sparck Jones. A statistical interpretation of term sp eciﬁcity and its application in retrieval. Journal of Docu- mentation, 28(1):11–21, 1972. 10 Using Knowledge Graphs to power LLM-Agent Memory

work page 1972
[3]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, S hishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024

work page 2024
[4]

From local to global: A graph rag approach to query-f ocused summarization, 2024

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Ale x Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-f ocused summarization, 2024

work page 2024
[5]

Zep: Long-term memory for ai agents

Zep. Zep: Long-term memory for ai agents. https://www.getzep.com, 2024. Commercial memory layer for AI applications

work page 2024
[6]

Graphiti: Temporal knowledge graphs for agentic ap plications

Zep. Graphiti: Temporal knowledge graphs for agentic ap plications. https://github.com/getzep/graphiti, 2024. Graphiti builds dynamic, temporally aware Knowledg e Graphs that represent complex, evolving relationships bet ween entities over time

work page 2024
[7]

Longmemeval: Benchmark- ing chat assistants on long-term interactive memory, 2024

Di Wu, Hongwei Wang, Wenhao Y u, Y uwei Zhang, Kai-Wei Chang, and Dong Y u. Longmemeval: Benchmark- ing chat assistants on long-term interactive memory, 2024

work page 2024
[8]

The relationship between semantic and episodic memory: Exp loring the effect of semantic neighbourhood density on episodic memory

Wong Gonzalez and Daniela. The relationship between semantic and episodic memory: Exp loring the effect of semantic neighbourhood density on episodic memory . PhD thesis, University of Winsor, 2018

work page 2018
[9]

Ari- graph: Learning knowledge graph world models with episodic memory for llm agents, 2024

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Ev seev, Mikhail Burtsev, and Evgeny Burnaev. Ari- graph: Learning knowledge graph world models with episodic memory for llm agents, 2024

work page 2024
[10]

Hiqa: A hierarchical contextual augmentation rag for multi-documents qa, 2024

Xinyue Chen, Pengyu Gao, Jiangjiang Song, and Xiaoyang Tan. Hiqa: A hierarchical contextual augmentation rag for multi-documents qa, 2024

work page 2024
[11]

Hiro: Hierarchical infor mation retrieval optimization, 2024

Krish Goel and Mahek Chandak. Hiro: Hierarchical infor mation retrieval optimization, 2024

work page 2024
[12]

Re- ﬂexion: Language agents with verbal reinforcement learnin g, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Go pinath, Karthik Narasimhan, and Shunyu Y ao. Re- ﬂexion: Language agents with verbal reinforcement learnin g, 2023

work page 2023
[13]

Learning from label ed and unlabeled data with label propagation

Xiaojin Zhu and Zoubin Ghahramani. Learning from label ed and unlabeled data with label propagation. 2002

work page 2002
[14]

V . A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 , 2019

work page 2019
[15]

Neo4j - the world’s leading graph database, 2012

Neo4j. Neo4j - the world’s leading graph database, 2012

work page 2012
[16]

Apache lucene - scoring, 2 011

Apache Software Foundation. Apache lucene - scoring, 2 011. letzter Zugriff: 20. Oktober 2011

work page 2011
[17]

Lightrag: Simple and fast retrieval-augmented generation, 2024

Zirui Guo, Lianghao Xia, Y anhua Y u, Tu Ao, and Chao Huang . Lightrag: Simple and fast retrieval-augmented generation, 2024

work page 2024
[18]

V ector search with openai embeddings: Lucene is all you need, 2023

Jimmy Lin, Ronak Pradeep, Tommaso Teoﬁli, and Jasper Xi an. V ector search with openai embeddings: Lucene is all you need, 2023

work page 2023
[19]

Distill-synthkg: Distilling knowledge graph synthesis workﬂow for improved coverage and efﬁcienc y, 2024

Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, C aiming Xiong, Tiep Le, Shachar Rosenman, V a- sudev Lal, Phil Mui, Ricky Ho, Phillip Howard, and Chien-She ng Wu. Distill-synthkg: Distilling knowledge graph synthesis workﬂow for improved coverage and efﬁcienc y, 2024

work page 2024
[20]

Cormack, Charles L

Gordon V . Cormack, Charles L. A. Clarke, and Stefan Buet tcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’09, pages 758–759. ACM, 2009

work page 2009
[21]

The use of mmr, dive rsity-based reranking for reordering documents and producing summaries

Jaime Carbonell and Jade Goldstein. The use of mmr, dive rsity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conf erence on Research and Development in Information Retrieval , SIGIR ’98, page 335–336, New Y ork, NY , USA, 1998. Associati on for Computing Machinery

work page 1998
[22]

Beyond goldﬁsh memory: Long-term open-domain conversation, 2021

Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldﬁsh memory: Long-term open-domain conversation, 2021

work page 2021
[23]

Ma king large language models a better foundation for dense retrieval, 2023

Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. Ma king large language models a better foundation for dense retrieval, 2023

work page 2023
[24]

Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text em beddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu L ian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text em beddings through self-knowledge distillation, 2024

work page 2024
[25]

Triplex: a sota llm for knowledge graph construction, 2024

Shreyas Pimpalgaonkar, Nolan Tremelling, and Owen Col egrove. Triplex: a sota llm for knowledge graph construction, 2024

work page 2024
[26]

Graphreader: Build ing graph-based agent to enhance long-context abilities of large language models, 2024

Shilong Li, Y ancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai , Jie Liu, Jiaheng Liu, Xingwei Qu, Y angguang Li, Wanli Ouyang, Wenbo Su, and Bo Zheng. Graphreader: Build ing graph-based agent to enhance long-context abilities of large language models, 2024. 11 Using Knowledge Graphs to power LLM-Agent Memory

work page 2024
[27]

Financebench: A new benchmark for ﬁnancial question answering, 2023

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qi an, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for ﬁnancial question answering, 2023

work page 2023
[28]

Beir: A heterogenous benchmark for zero-shot evaluation of information retriev al models, 2021

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retriev al models, 2021. 12

work page 2021

[1] [1]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkor eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

work page 2023

[2] [2]

Sparck Jones

K. Sparck Jones. A statistical interpretation of term sp eciﬁcity and its application in retrieval. Journal of Docu- mentation, 28(1):11–21, 1972. 10 Using Knowledge Graphs to power LLM-Agent Memory

work page 1972

[3] [3]

Patil, Ion Stoica, and Joseph E

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, S hishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024

work page 2024

[4] [4]

From local to global: A graph rag approach to query-f ocused summarization, 2024

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Ale x Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-f ocused summarization, 2024

work page 2024

[5] [5]

Zep: Long-term memory for ai agents

Zep. Zep: Long-term memory for ai agents. https://www.getzep.com, 2024. Commercial memory layer for AI applications

work page 2024

[6] [6]

Graphiti: Temporal knowledge graphs for agentic ap plications

Zep. Graphiti: Temporal knowledge graphs for agentic ap plications. https://github.com/getzep/graphiti, 2024. Graphiti builds dynamic, temporally aware Knowledg e Graphs that represent complex, evolving relationships bet ween entities over time

work page 2024

[7] [7]

Longmemeval: Benchmark- ing chat assistants on long-term interactive memory, 2024

Di Wu, Hongwei Wang, Wenhao Y u, Y uwei Zhang, Kai-Wei Chang, and Dong Y u. Longmemeval: Benchmark- ing chat assistants on long-term interactive memory, 2024

work page 2024

[8] [8]

The relationship between semantic and episodic memory: Exp loring the effect of semantic neighbourhood density on episodic memory

Wong Gonzalez and Daniela. The relationship between semantic and episodic memory: Exp loring the effect of semantic neighbourhood density on episodic memory . PhD thesis, University of Winsor, 2018

work page 2018

[9] [9]

Ari- graph: Learning knowledge graph world models with episodic memory for llm agents, 2024

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Ev seev, Mikhail Burtsev, and Evgeny Burnaev. Ari- graph: Learning knowledge graph world models with episodic memory for llm agents, 2024

work page 2024

[10] [10]

Hiqa: A hierarchical contextual augmentation rag for multi-documents qa, 2024

Xinyue Chen, Pengyu Gao, Jiangjiang Song, and Xiaoyang Tan. Hiqa: A hierarchical contextual augmentation rag for multi-documents qa, 2024

work page 2024

[11] [11]

Hiro: Hierarchical infor mation retrieval optimization, 2024

Krish Goel and Mahek Chandak. Hiro: Hierarchical infor mation retrieval optimization, 2024

work page 2024

[12] [12]

Re- ﬂexion: Language agents with verbal reinforcement learnin g, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Go pinath, Karthik Narasimhan, and Shunyu Y ao. Re- ﬂexion: Language agents with verbal reinforcement learnin g, 2023

work page 2023

[13] [13]

Learning from label ed and unlabeled data with label propagation

Xiaojin Zhu and Zoubin Ghahramani. Learning from label ed and unlabeled data with label propagation. 2002

work page 2002

[14] [14]

V . A. Traag, L. Waltman, and N. J. van Eck. From louvain to leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 , 2019

work page 2019

[15] [15]

Neo4j - the world’s leading graph database, 2012

Neo4j. Neo4j - the world’s leading graph database, 2012

work page 2012

[16] [16]

Apache lucene - scoring, 2 011

Apache Software Foundation. Apache lucene - scoring, 2 011. letzter Zugriff: 20. Oktober 2011

work page 2011

[17] [17]

Lightrag: Simple and fast retrieval-augmented generation, 2024

Zirui Guo, Lianghao Xia, Y anhua Y u, Tu Ao, and Chao Huang . Lightrag: Simple and fast retrieval-augmented generation, 2024

work page 2024

[18] [18]

V ector search with openai embeddings: Lucene is all you need, 2023

Jimmy Lin, Ronak Pradeep, Tommaso Teoﬁli, and Jasper Xi an. V ector search with openai embeddings: Lucene is all you need, 2023

work page 2023

[19] [19]

Distill-synthkg: Distilling knowledge graph synthesis workﬂow for improved coverage and efﬁcienc y, 2024

Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, C aiming Xiong, Tiep Le, Shachar Rosenman, V a- sudev Lal, Phil Mui, Ricky Ho, Phillip Howard, and Chien-She ng Wu. Distill-synthkg: Distilling knowledge graph synthesis workﬂow for improved coverage and efﬁcienc y, 2024

work page 2024

[20] [20]

Cormack, Charles L

Gordon V . Cormack, Charles L. A. Clarke, and Stefan Buet tcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’09, pages 758–759. ACM, 2009

work page 2009

[21] [21]

The use of mmr, dive rsity-based reranking for reordering documents and producing summaries

Jaime Carbonell and Jade Goldstein. The use of mmr, dive rsity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR Conf erence on Research and Development in Information Retrieval , SIGIR ’98, page 335–336, New Y ork, NY , USA, 1998. Associati on for Computing Machinery

work page 1998

[22] [22]

Beyond goldﬁsh memory: Long-term open-domain conversation, 2021

Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldﬁsh memory: Long-term open-domain conversation, 2021

work page 2021

[23] [23]

Ma king large language models a better foundation for dense retrieval, 2023

Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia Shao. Ma king large language models a better foundation for dense retrieval, 2023

work page 2023

[24] [24]

Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text em beddings through self-knowledge distillation, 2024

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu L ian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text em beddings through self-knowledge distillation, 2024

work page 2024

[25] [25]

Triplex: a sota llm for knowledge graph construction, 2024

Shreyas Pimpalgaonkar, Nolan Tremelling, and Owen Col egrove. Triplex: a sota llm for knowledge graph construction, 2024

work page 2024

[26] [26]

Graphreader: Build ing graph-based agent to enhance long-context abilities of large language models, 2024

Shilong Li, Y ancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai , Jie Liu, Jiaheng Liu, Xingwei Qu, Y angguang Li, Wanli Ouyang, Wenbo Su, and Bo Zheng. Graphreader: Build ing graph-based agent to enhance long-context abilities of large language models, 2024. 11 Using Knowledge Graphs to power LLM-Agent Memory

work page 2024

[27] [27]

Financebench: A new benchmark for ﬁnancial question answering, 2023

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qi an, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for ﬁnancial question answering, 2023

work page 2023

[28] [28]

Beir: A heterogenous benchmark for zero-shot evaluation of information retriev al models, 2021

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retriev al models, 2021. 12

work page 2021