Recognition: 3 theorem links
· Lean TheoremMem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Pith reviewed 2026-05-10 23:07 UTC · model grok-4.3
The pith
Mem0 dynamically extracts and consolidates key facts from conversations to give LLMs reliable long-term memory without processing full histories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mem0 is a scalable memory-centric architecture that dynamically extracts, consolidates, and retrieves salient information from ongoing conversations. An enhanced variant uses graph-based representations to capture complex relational structures among conversational elements. On the LOCOMO benchmark it outperforms established memory systems, RAG setups, full-context processing, open-source solutions, proprietary systems, and dedicated memory platforms across single-hop, temporal, multi-hop, and open-domain questions. Mem0 achieves 26% relative improvement in the LLM-as-a-Judge metric over OpenAI, the graph version scores about 2% higher overall, and both deliver 91% lower p95 latency with more
What carries the argument
Mem0's dynamic extraction, consolidation, and retrieval pipeline for salient conversational information, together with its optional graph-based memory representation for relational structures.
If this is right
- Outperforms all tested baselines on single-hop, temporal, multi-hop, and open-domain questions.
- Delivers 26% relative gain in LLM-as-a-Judge score over OpenAI memory.
- Graph memory variant adds roughly 2% overall score improvement over the base Mem0.
- Reduces p95 latency by 91% and token cost by more than 90% versus full-context processing.
Where Pith is reading between the lines
- If extraction remains reliable at scale, the approach could support agents that maintain coherence across weeks of interaction rather than single sessions.
- The relational graph may prove especially useful for tasks that track how facts evolve or connect over time, suggesting targeted tests on longer dependency chains.
- Combining this memory layer with other agent components such as planning or tool use could further improve production deployment without proportional cost increases.
- The efficiency gains open the possibility of running multiple parallel agents on the same hardware while each retains its own long-term context.
Load-bearing premise
Extracting and consolidating only the most salient facts from conversations preserves every piece of context required for correct answers to complex multi-hop and temporal questions.
What would settle it
A new evaluation set of long multi-session dialogues containing explicit temporal chains and multi-hop dependencies where full-context processing scores measurably higher than Mem0 on accuracy metrics.
read the original abstract
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mem0, a scalable memory-centric architecture for LLMs that dynamically extracts, consolidates, and retrieves salient information from multi-session conversations, along with a graph-based variant for capturing relational structures. It evaluates both variants on the LOCOMO benchmark against six categories of baselines (memory-augmented systems, RAG variants, full-context, open-source, proprietary, and dedicated platforms), claiming consistent outperformance across single-hop, temporal, multi-hop, and open-domain questions, including a 26% relative gain in LLM-as-Judge over OpenAI, ~2% additional gain from the graph variant, 91% lower p95 latency, and >90% token cost savings versus full-context.
Significance. If the results hold after addressing the gaps below, this would represent a practical contribution to production-ready long-term memory for AI agents, with notable efficiency advantages over full-context baselines that could enable scalable deployment. The breadth of baseline comparisons across question categories is a strength, though the absence of targeted ablations and error analysis limits the ability to attribute gains specifically to the proposed extraction and graph mechanisms.
major comments (2)
- [Experimental evaluation (Section 4)] Experimental evaluation (Section 4 / LOCOMO results): Aggregate scores are reported for the four question categories and LLM-as-Judge metric, but no per-question error analysis, extraction-precision audit against gold facts, or ablation isolating dynamic extraction/consolidation failures from retrieval/graph issues is provided. This is load-bearing for the central claim, as omissions in temporal anchors or cross-turn entities could explain gains without the memory mechanism itself being superior.
- [Methodology] Methodology and implementation details: The manuscript does not specify data splits for LOCOMO, exact extraction prompts/models, graph construction algorithm, or precise configurations for all six baseline categories (e.g., chunk sizes and k for RAG). Without these, the 26% relative improvement and efficiency metrics cannot be independently verified or reproduced.
minor comments (1)
- [Abstract] The abstract states 'around 2% higher overall score' for the graph variant; the main text should report the exact metric, absolute values, and statistical significance for this comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical contributions of Mem0 to scalable long-term memory for AI agents. The comments highlight important areas for improving the strength of our claims and reproducibility. We address each major comment below and have revised the manuscript to incorporate additional analysis and details where feasible.
read point-by-point responses
-
Referee: [Experimental evaluation (Section 4)] Experimental evaluation (Section 4 / LOCOMO results): Aggregate scores are reported for the four question categories and LLM-as-Judge metric, but no per-question error analysis, extraction-precision audit against gold facts, or ablation isolating dynamic extraction/consolidation failures from retrieval/graph issues is provided. This is load-bearing for the central claim, as omissions in temporal anchors or cross-turn entities could explain gains without the memory mechanism itself being superior.
Authors: We agree that aggregate metrics alone make it harder to isolate the contributions of dynamic extraction, consolidation, and graph-based retrieval. In the revised manuscript we will add a dedicated error analysis subsection in Section 4 that provides per-category breakdowns (single-hop, temporal, multi-hop, open-domain) with representative success and failure examples, focusing on cases involving temporal anchors and cross-turn entities. We will also include targeted ablations: (i) Mem0 without dynamic extraction/consolidation, (ii) base Mem0 versus graph variant, and (iii) retrieval-only versus full memory pipeline. These will help attribute gains more precisely to the proposed mechanisms. A full extraction-precision audit against gold facts is not possible because LOCOMO does not provide such annotations; we will instead report precision estimates from manual inspection of a sampled subset of extracted memories and note this as a limitation. revision: partial
-
Referee: [Methodology] Methodology and implementation details: The manuscript does not specify data splits for LOCOMO, exact extraction prompts/models, graph construction algorithm, or precise configurations for all six baseline categories (e.g., chunk sizes and k for RAG). Without these, the 26% relative improvement and efficiency metrics cannot be independently verified or reproduced.
Authors: We acknowledge that the original manuscript omitted several implementation details necessary for full reproducibility. The revised version will expand the Experimental Setup section with: (1) LOCOMO data usage and any train/test splits applied; (2) the exact extraction and consolidation prompts together with the underlying models (gpt-4o for extraction, gpt-4o-mini for retrieval); (3) the graph construction algorithm, which uses LLM-based entity-relation extraction followed by incremental graph updates; and (4) complete baseline configurations, including chunk sizes (256/512/1024 tokens) and k values (3/5/10) for all RAG variants, as well as the exact settings for the other five baseline categories. These additions will allow independent verification of the reported accuracy gains, 91% p95 latency reduction, and >90% token cost savings. revision: yes
Circularity Check
Empirical benchmark evaluation with no derivation chain
full rationale
The paper proposes the Mem0 architecture for dynamic memory extraction/consolidation/retrieval in LLMs and evaluates it empirically on the external LOCOMO benchmark against six categories of baselines. All reported results (26% relative LLM-as-Judge gain, 2% graph variant uplift, 91% p95 latency reduction, >90% token savings) are direct performance comparisons to independent systems rather than any first-principles derivation, fitted-parameter prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes appear in the provided text; the central claims rest on aggregate benchmark scores without reduction to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dynamic extraction of salient information from conversations can be done reliably enough to support multi-hop and temporal reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclearWe introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements.
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclearEmpirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration.
-
IndisputableMonolith.Foundation.DiscretenessForcingdiscreteness_forced unclearMem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints.
Forward citations
Cited by 60 Pith papers
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
-
MEME: Multi-entity & Evolving Memory Evaluation
All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
-
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.
-
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
-
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
-
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection
CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of basel...
-
Stateful Agent Backdoor
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
-
Belief Memory: Agent Memory Under Partial Observability
BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
-
Latent Preference Modeling for Cross-Session Personalized Tool Calling
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
-
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
-
The Missing Knowledge Layer in Cognitive Architectures for AI Agents
Cognitive architectures for AI agents require a distinct Knowledge layer with indefinite supersession persistence, separate from Memory decay, Wisdom evidence-gating, and Intelligence ephemerality.
-
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
-
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
-
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
-
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
-
Beyond Similarity Search: Tenure and the Case for Structured Belief State in LLM Memory
Tenure replaces similarity search with a structured belief store using scope isolation and alias-weighted BM25 retrieval, achieving 1.0 precision on 72 cases where cosine similarity scores 0.12.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
MemPrivacy replaces privacy-sensitive spans with structured placeholders on edge devices to enable effective cloud memory management while limiting utility loss to 1.6% and outperforming general models on privacy extraction.
-
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
MemPrivacy uses edge detection of sensitive spans and type-aware placeholders to enable cloud-side memory management for LLM agents without exposing private data, achieving under 1.6% utility loss.
-
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
-
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
SkillMaster enables LLM agents to autonomously develop skills via trajectory review, counterfactual evaluation, and DualAdv-GRPO training, boosting success rates by 8.8% on ALFWorld and 9.3% on WebShop.
-
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
SkillMaster is a training framework that lets LLM agents autonomously propose, update, and apply skills, yielding 8.8% and 9.3% higher success rates on ALFWorld and WebShop than prior methods.
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and ra...
-
GASim: A Graph-Accelerated Hybrid Framework for Social Simulation
GASim accelerates hybrid LLM-ABM social simulations via graph-optimized memory, graph message passing, and entropy-driven agent grouping, delivering 9.94x speedup and under 20% token use while aligning with real-world trends.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
-
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...
-
Tree-based Credit Assignment for Multi-Agent Memory System
TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.
-
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
-
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments
AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, no...
-
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
-
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
-
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
-
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents
Dual-trace encoding improves LLM agent cross-session recall from 53.5% to 73.7% accuracy by storing facts alongside concrete scene reconstructions, with largest gains in temporal reasoning and multi-session aggregation.
-
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
GAM decouples event-level memory encoding from topic-level consolidation in LLM agents using hierarchical graphs to reduce interference and improve long-term coherence and retrieval.
-
Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo
Synthius-Mem achieves 94.37% accuracy and 99.55% adversarial robustness on LoCoMo by extracting and consolidating structured persona facts across six domains rather than retrieving dialogue segments.
-
Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards
Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.
-
ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying
ADAM extracts data from LLM agent memory with up to 100% attack success rate by estimating data distribution and selecting queries via entropy guidance.
Reference graph
Works this paper leans on
-
[1]
Carefully analyze all provided memories from both speakers
-
[2]
Pay special attention to the timestamps to determine the answer
-
[3]
If the question asks about a specific event or fact, look for direct evidence in the memories
-
[4]
If the memories contain contradictory information, prioritize the most recent memory
- [5]
- [6]
-
[7]
Focus only on the content of the memories from both speakers. Do not confuse character names mentioned in memories with the actual users who created those memories
-
[8]
# APPROACH (Think step by step):
The answer should be less than 5-6 words. # APPROACH (Think step by step):
-
[15]
Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Question: {question} Answer: 19 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory P rom p t Te m p l at e f or Re s u lt s G e n e r at ion (M e...
-
[16]
First, examine all memories that contain information related to the question
-
[17]
Examine the timestamps and content of these memories carefully
-
[18]
Look for explicit mentions of dates, times, locations, or events that answer the question
-
[19]
If the answer requires calculation (e.g., converting relative time references), show your work
-
[20]
Analyze the knowledge graph relations to understand the user’s knowledge context
-
[21]
Formulate a precise, concise answer based solely on the evidence in the memories
-
[22]
Double-check that your answer directly addresses the question asked
-
[23]
Ensure your final answer is specific and avoids vague time references Memories for user {speaker_1_user_id}: {speaker_1_memories} Relations for user {speaker_1_user_id}: {speaker_1_graph_memories} Memories for user {speaker_2_user_id}: {speaker_2_memories} Relations for user {speaker_2_user_id}: {speaker_2_graph_memories} Question: {question} Answer: P ro...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.