A-MEM: Agentic Memory for LLM Agents

Hang Gao; Juntao Tan; Kai Mei; Wujiang Xu; Yongfeng Zhang; Zujie Liang

arxiv: 2502.12110 · v11 · submitted 2025-02-17 · 💻 cs.CL · cs.HC

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu , Zujie Liang , Kai Mei , Hang Gao , Juntao Tan , Yongfeng Zhang This is my paper

Pith reviewed 2026-05-11 00:40 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords agentic memoryLLM agentsdynamic memory organizationZettelkasten methodmemory linkingmemory evolutionknowledge networksagent memory systems

0 comments

The pith

An agentic memory system lets LLM agents dynamically index, link, and evolve interconnected knowledge networks from their experiences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a memory system for LLM agents that creates structured notes for each new experience and automatically links them to relevant past memories while updating older entries. It draws on Zettelkasten principles of dynamic indexing and connection-making to replace the fixed storage and retrieval used in prior agent memory designs. This setup allows the memory network to grow and refine itself as new tasks arrive. A reader would care because agents that maintain adaptable, linked histories can handle longer and more varied real-world sequences without forgetting or repeating mistakes. Experiments across six foundation models report better results than existing state-of-the-art memory baselines.

Core claim

The central claim is that an agentic memory system, by generating notes with contextual descriptions, keywords, and tags for each new memory and then analyzing historical memories to establish meaningful links and trigger updates to existing entries, produces an evolving interconnected knowledge network that improves agent performance on complex tasks.

What carries the argument

The agentic memory process that creates structured notes and performs dynamic similarity-based linking together with evolution updates to prior memories.

If this is right

Agents gain adaptability across diverse tasks because memory organization is no longer limited to fixed operations and structures.
Historical experiences become more usable as new memories trigger refinements to the contextual representations of older ones.
The memory network continuously evolves rather than remaining static, supporting longer-term task sequences.
Performance gains appear consistently across multiple foundation models when compared with prior state-of-the-art memory systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents using this memory approach could maintain coherence over hundreds of steps without external human intervention to correct memory errors.
The same linking mechanism might be applied to multi-agent settings where separate agents share and evolve a joint memory network.
Efficiency questions arise for very large memory collections, where the cost of repeated similarity analysis could become a bottleneck.

Load-bearing premise

The underlying LLM must reliably produce accurate contextual descriptions, keywords, tags, and meaningful links without introducing errors or hallucinations that degrade the overall memory network.

What would settle it

Measure task performance on the six foundation models when the system is used versus when fixed memory baselines are used; if no consistent improvement appears, or if incorrect links cause measurable degradation over long sequences, the central claim does not hold.

read the original abstract

While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A-MEM gives LLM agents a Zettelkasten-style memory that creates notes, links them, and evolves old entries when new ones arrive, with public code and reported gains on six models.

read the letter

The main contribution here is a memory architecture where the LLM itself handles note creation with contextual descriptions, keywords, and tags, then decides on links to historical entries and can revise those older notes as the network grows. This agent-driven evolution is the piece that feels new compared to more static graph or retrieval systems in prior agent work. They back it with experiments across six models that beat the baselines they chose, and both the system code and the evaluation scripts are on GitHub, which makes the claims easier to inspect and try out directly. That openness is a real plus for anyone who wants to see the implementation details or build on it. The approach combines structured organization with on-the-fly decisions, which addresses the fixed-structure problem they point out in existing memory systems. The soft spot is the closed loop the stress test highlights: every new note depends on the LLM producing accurate attributes and sensible links, and those outputs can then alter prior memories. If a generation step introduces errors or weak connections, they get indexed and potentially retrieved or amplified later. The abstract reports the performance lift but gives no numbers on note fidelity, link accuracy, or any validation steps for the generated content. Without that, it is difficult to separate real memory improvement from cases where the model simply handles the downstream tasks well. This is aimed at researchers and engineers working on long-horizon LLM agents who need more adaptive memory handling. The concrete design and released code make it worth a serious referee even with the current gaps in the validation details, because reviewers can check the implementation and ask for the missing checks on generation quality.

Referee Report

2 major / 2 minor

Summary. The paper proposes A-MEM, an agentic memory system for LLM agents inspired by Zettelkasten principles. It dynamically creates structured notes (contextual descriptions, keywords, tags) for new memories via LLM prompting, identifies links to historical memories, and enables memory evolution by updating prior entries' representations as new information integrates. This forms an evolving interconnected knowledge network. The central claim is that this yields superior performance improvements over existing SOTA baselines across experiments on six foundation models, with source code released at two GitHub repositories.

Significance. If the empirical gains are robust and the memory network remains stable, the work could meaningfully advance memory systems for LLM agents by enabling adaptive, context-aware organization beyond fixed retrieval or static graphs. The explicit release of both evaluation and system code is a clear strength that supports reproducibility and follow-on work.

major comments (2)

[Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.
[§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.

minor comments (2)

[Abstract] The abstract mentions 'recent attempts to incorporate graph databases' but does not cite specific prior systems; adding 1-2 concrete references would clarify the positioning.
[§3] Figure captions and algorithm pseudocode (if present in §3) could more explicitly label the LLM prompting steps versus the graph-update steps to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript describing A-MEM. The comments highlight important areas for strengthening the presentation of our empirical results and the reliability of the memory operations. We address each major comment below and have revised the manuscript accordingly to improve clarity, completeness, and rigor.

read point-by-point responses

Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim of 'superior improvement against existing SOTA baselines' on six models is presented without any description of the experimental setup, specific baselines, evaluation metrics, statistical significance tests, task benchmarks, or controls for variance. This is load-bearing for the central empirical claim.

Authors: We agree that the abstract is high-level and does not enumerate experimental details. The Experiments section (Section 4) does describe the six foundation models, task benchmarks (agentic QA, tool-use, and multi-step reasoning tasks), SOTA baselines (including fixed-retrieval and graph-memory systems), metrics (success rate, latency, and memory efficiency), and variance controls via repeated runs with different seeds. However, we acknowledge that statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with reported p-values) and more explicit baseline implementation details were not sufficiently highlighted. In the revised manuscript we will (1) update the abstract with a concise sentence on the evaluation framework and (2) add a dedicated “Experimental Setup” subsection that includes all requested elements plus significance tests. These changes directly support the central empirical claim. revision: yes
Referee: [§3 (Memory Addition and Evolution)] §3 (Memory Addition and Evolution): the system relies on the LLM to generate accurate contextual descriptions, keywords, tags, and links, then to rewrite existing memories. No quantitative fidelity check, error-rate measurement, or manual validation of generated attributes and link quality is reported. Because updates create a closed loop that can propagate errors, this directly affects whether the claimed performance gains can be sustained.

Authors: We recognize this as a substantive limitation. The original submission emphasizes end-to-end task performance and does not report direct fidelity measurements on the LLM-generated notes or links. In the revised version we will insert a new subsection (under Section 3 or 4) that presents quantitative validation: human evaluation on 200 randomly sampled memories measuring (a) accuracy of contextual descriptions, (b) relevance of keywords and tags, and (c) precision/recall of generated links. We will also report an error-propagation analysis by tracking how often an erroneous update affects downstream retrieval. These additions will allow readers to assess the robustness of the closed-loop evolution process. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system proposal without derivational reductions

full rationale

The paper presents a design for an agentic memory system that generates structured notes, identifies links, and evolves prior entries via LLM prompts, explicitly following Zettelkasten principles. No equations, fitted parameters, uniqueness theorems, or mathematical derivations appear in the abstract or description. All performance claims rest on external empirical experiments across six models against SOTA baselines rather than any internal self-definition, prediction-from-fit, or self-citation chain that reduces the central result to its own inputs by construction. The system is therefore self-contained as an engineering proposal evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that LLMs can perform reliable memory organization tasks and that the Zettelkasten-inspired structure improves agent performance; no explicit free parameters or invented physical entities are described.

axioms (2)

domain assumption LLM agents require sophisticated memory organization beyond basic storage and retrieval to handle complex tasks effectively.
Stated in the opening of the abstract as motivation for the work.
domain assumption Dynamic indexing, linking, and evolution of memories will produce an adaptive knowledge network superior to fixed-structure systems.
Core design principle presented as following Zettelkasten method.

invented entities (1)

Agentic memory network with evolving links no independent evidence
purpose: To enable continuous refinement of historical memories through new integrations.
Introduced as the core output of the system; no independent falsifiable evidence outside the proposed implementation is provided in the abstract.

pith-pipeline@v0.9.0 · 5581 in / 1433 out tokens · 39994 ms · 2026-05-11T00:40:57.572182+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LedgerForcing conservation_from_balance echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist.
Foundation.LedgerForcing add_event_balanced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Additionally, this process enables memory evolution – as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
cs.AI 2026-05 unverdicted novelty 8.0

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
cs.CL 2026-03 unverdicted novelty 8.0

AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts
cs.IR 2026-05 unverdicted novelty 7.0

MemConflict provides a benchmark for testing LLM long-term memory systems under dynamic, static, and conditional conflicts involving temporal validity, factual correctness, and contextual applicability.
MemGym: a Long-Horizon Memory Environment for LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
EXG: Self-Evolving Agents with Experience Graphs
cs.AI 2026-05 unverdicted novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 unverdicted novelty 7.0

GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 unverdicted novelty 7.0

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
cs.AI 2026-05 unverdicted novelty 7.0

LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
cs.AI 2026-05 unverdicted novelty 7.0

EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines o...
Belief Memory: Agent Memory Under Partial Observability
cs.AI 2026-05 unverdicted novelty 7.0

BeliefMem stores multiple candidate conclusions with probabilities in agent memory and updates them via Noisy-OR rules to preserve uncertainty under partial observability.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
AEL: Agent Evolving Learning for Open-Ended Environments
cs.CL 2026-04 conditional novelty 7.0

AEL uses a fast-timescale bandit for memory policy selection and slow-timescale LLM reflection for causal insights, achieving a Sharpe ratio of 2.13 on a 208-episode portfolio benchmark while showing that added mechan...
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
cs.CL 2026-04 unverdicted novelty 7.0

Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
cs.IR 2026-04 conditional novelty 7.0

vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
When to Forget: A Memory Governance Primitive
cs.AI 2026-04 unverdicted novelty 7.0

Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent
cs.AI 2026-04 unverdicted novelty 7.0

PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
GRAB-ANNS: High-Throughput Indexing and Hybrid Search via GPU-Native Bucketing
cs.DB 2026-03 unverdicted novelty 7.0

GRAB-ANNS is a new GPU graph index that achieves up to 240x higher hybrid search throughput via bucket layouts and hybrid intra/inter-bucket edges.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
cs.AI 2026-03 unverdicted novelty 7.0

PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams
cs.CL 2026-03 unverdicted novelty 7.0

SensorPersona uses LLMs for hierarchical reasoning on longitudinal mobile sensor streams to continually extract stable personas, showing up to 31.4% higher recall and 85.7% win rate over baselines on a 20-user dataset.
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
cs.CV 2026-03 unverdicted novelty 7.0

MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
cs.CL 2025-11 unverdicted novelty 7.0

MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
$How^{2}$: How to learn from procedural How-to questions
cs.AI 2025-10 unverdicted novelty 7.0

$How^{2}$ is a memory agent framework enabling agents to ask, store, and reuse answers to how-to questions at varying abstraction levels for better lifelong planning in environments like Plancraft.
MIRIX: Multi-Agent Memory System for LLM-Based Agents
cs.CL 2025-07 unverdicted novelty 7.0

MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
cs.CL 2025-07 unverdicted novelty 7.0

MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods ...
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
cs.CR 2024-10 unverdicted novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
cs.CL 2026-05 unverdicted novelty 6.0

Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
Self-Evolving Multi-Agent Systems via Decentralized Memory
cs.MA 2026-05 unverdicted novelty 6.0

DecentMem is a decentralized dual-pool memory framework for self-evolving multi-agent systems that provides O(log T) regret guarantees and yields up to 23.8% accuracy gains over centralized baselines.
EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning
cs.CV 2026-05 unverdicted novelty 6.0

EvoIR-Agent formulates experience components into a hierarchical pool with a self-evolving update mechanism to improve performance and efficiency of training-free MLLM image restoration agents over prior paradigms.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
cs.CL 2026-05 unverdicted novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
MedExpMem: Adapting Experience Memory for Differential Diagnosis
cs.LG 2026-05 unverdicted novelty 6.0

MedExpMem lets VLM diagnostic agents store and retrieve experience from past failures as pairwise differential notes, producing up to 7% accuracy gains on a multi-subspecialty radiology benchmark.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
cs.CV 2026-05 conditional novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
State Contamination in Memory-Augmented LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
cs.CL 2026-05 unverdicted novelty 6.0

DimMem introduces a dimensional memory framework that structures memories as typed atomic units to improve retrieval efficiency and accuracy for long-term LLM agent tasks.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
cs.CL 2026-05 unverdicted novelty 6.0

H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.
Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
cs.LG 2026-05 unverdicted novelty 6.0

SeqMem-Eval reveals that high final accuracy in sequential LLM memory tasks often coexists with substantial forgetting and negative transfer, exposing stability-adaptability trade-offs hidden by standard aggregate metrics.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 129 Pith papers · 8 internal anchors

[1]

Amazon, 2017

Sönke Ahrens.How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking. Amazon, 2017. Second Edition

work page 2017
[2]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. Anthropic, Mar 2024. Accessed May 2025

work page 2024
[3]

Claude 3.5 sonnet model card addendum

Anthropic. Claude 3.5 sonnet model card addendum. Technical report, Anthropic, 2025. Accessed May 2025

work page 2025
[4]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023

work page internal anchor Pith review arXiv 2023
[5]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

work page 2005
[6]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022
[7]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[8]

mem0: The memory layer for ai agents

Khant Dev and Singh Taranjeet. mem0: The memory layer for ai agents. https://github. com/mem0ai/mem0, 2024

work page 2024
[9]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[10]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

I. Ilin. Advanced rag techniques: An illustrated overview, 2023

work page 2023
[13]

Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

work page arXiv 2023
[14]

Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023

work page arXiv 2023
[15]

Google Books, May 2021

David Kadavy.Digital Zettelkasten: Principles, Methods, & Examples. Google Books, May 2021

work page 2021
[16]

arXiv preprint arXiv:2406.13144 , year=

Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term multi-party dialogue understanding of conversational agents.arXiv preprint arXiv:2406.13144, 2024

work page arXiv 2024
[17]

A human- inspired reading agent with gist memory of very long contexts,

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024. 10

work page arXiv 2024
[18]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[19]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004
[20]

Ra-dit: Retrieval- augmented dual instruction tuning,

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352, 2023

work page arXiv 2023
[21]

Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K Choubey, Tian Lan, Jason Wu, Huan Wang, et al. Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

work page arXiv 2024
[22]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review arXiv 2024
[23]

Aios: Llm agent operating system.arXiv e-prints, pp

Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system.arXiv e-prints, pp. arXiv–2403, 2024

work page 2024
[24]

Modarressi, A

Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023

work page arXiv 2023
[25]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[27]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

work page 2019
[28]

‘smolagents‘: a smol library to build great agentic systems

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025

work page 2025
[29]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

work page arXiv 2023
[30]

From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

work page arXiv 2024
[31]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022

work page internal anchor Pith review arXiv 2022
[32]

Enhancing large language model with self-controlled memory framework

Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343, 2023

work page arXiv 2023
[33]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024
[34]

Learning to filter context for retrieval-augmented generation,

Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. Learning to filter context for retrieval-augmented generation.arXiv preprint arXiv:2311.08377, 2023. 11

work page arXiv 2023
[35]

Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

Lilian Weng. Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

work page 2023
[36]

Beyond Goldfish Memory: Long-Term Open-Domain Conversation

J Xu. Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021

work page arXiv 2021
[37]

Chain-of-note: Enhancing robustness in retrieval-augmented language models,

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models.arXiv preprint arXiv:2311.09210, 2023

work page arXiv 2023
[38]

Augmentation-adapted retriever improves generalization of language models as generic plug-in,

Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in.arXiv preprint arXiv:2305.17331, 2023

work page arXiv 2023
[39]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. 12 Contents 1 Introduction 1 2 Related Work 2 2.1 Memory for LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2024
[40]

Identifying the most salient keywords (focus on nouns, verbs, and key concepts)

work page
[41]

Extracting core themes and contextual elements

work page
[42]

keywords

Creating relevant categorical tags Format the response as a JSON object: { "keywords": [ // several specific, distinct keywords that capture key concepts and terminology // Order from most to least important // Don’t include keywords that are the name of the speaker or time // At least three keywords, but don’t be too redundant. ], "context": // one sente...

work page
[43]

should_evolve

What specific actions should be taken (strengthen, update_neighbor)? 1.1 If choose to strengthen the connection, which memory should it be connected to? Can you give the updated tags of this memory? 1.2 If choose to update neighbor, you can update the context and tags of these memories based on the understanding of these memories. Tags should be determine...

work page 2023
[44]

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025

[1] [1]

Amazon, 2017

Sönke Ahrens.How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking. Amazon, 2017. Second Edition

work page 2017

[2] [2]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. Anthropic, Mar 2024. Accessed May 2025

work page 2024

[3] [3]

Claude 3.5 sonnet model card addendum

Anthropic. Claude 3.5 sonnet model card addendum. Technical report, Anthropic, 2025. Accessed May 2025

work page 2025

[4] [4]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023

work page internal anchor Pith review arXiv 2023

[5] [5]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

work page 2005

[6] [6]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022

[7] [7]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023

[8] [8]

mem0: The memory layer for ai agents

Khant Dev and Singh Taranjeet. mem0: The memory layer for ai agents. https://github. com/mem0ai/mem0, 2024

work page 2024

[9] [9]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024

[10] [10]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

I. Ilin. Advanced rag techniques: An illustrated overview, 2023

work page 2023

[13] [13]

Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations.arXiv preprint arXiv:2310.13420, 2023

work page arXiv 2023

[14] [14]

Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.arXiv preprint arXiv:2305.06983, 2023

work page arXiv 2023

[15] [15]

Google Books, May 2021

David Kadavy.Digital Zettelkasten: Principles, Methods, & Examples. Google Books, May 2021

work page 2021

[16] [16]

arXiv preprint arXiv:2406.13144 , year=

Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term multi-party dialogue understanding of conversational agents.arXiv preprint arXiv:2406.13144, 2024

work page arXiv 2024

[17] [17]

A human- inspired reading agent with gist memory of very long contexts,

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727, 2024. 10

work page arXiv 2024

[18] [18]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020

[19] [19]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

work page 2004

[20] [20]

Ra-dit: Retrieval- augmented dual instruction tuning,

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. Ra-dit: Retrieval-augmented dual instruction tuning.arXiv preprint arXiv:2310.01352, 2023

work page arXiv 2023

[21] [21]

Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K Choubey, Tian Lan, Jason Wu, Huan Wang, et al. Agentlite: A lightweight library for building and advancing task-oriented llm agent system.arXiv preprint arXiv:2402.15538, 2024

work page arXiv 2024

[22] [22]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024

work page internal anchor Pith review arXiv 2024

[23] [23]

Aios: Llm agent operating system.arXiv e-prints, pp

Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system.arXiv e-prints, pp. arXiv–2403, 2024

work page 2024

[24] [24]

Modarressi, A

Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards a general read-write memory for large language models.arXiv preprint arXiv:2305.14322, 2023

work page arXiv 2023

[25] [25]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002

[27] [27]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

work page 2019

[28] [28]

‘smolagents‘: a smol library to build great agentic systems

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025

work page 2025

[29] [29]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy,

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294, 2023

work page arXiv 2023

[30] [30]

From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. From commands to prompts: Llm-based semantic file system for aios.arXiv preprint arXiv:2410.11843, 2024

work page arXiv 2024

[31] [31]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509, 2022

work page internal anchor Pith review arXiv 2022

[32] [32]

Enhancing large language model with self-controlled memory framework

Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343, 2023

work page arXiv 2023

[33] [33]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

work page internal anchor Pith review arXiv 2024

[34] [34]

Learning to filter context for retrieval-augmented generation,

Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. Learning to filter context for retrieval-augmented generation.arXiv preprint arXiv:2311.08377, 2023. 11

work page arXiv 2023

[35] [35]

Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

Lilian Weng. Llm-powered autonomous agents.lilianweng.github.io, Jun 2023

work page 2023

[36] [36]

Beyond Goldfish Memory: Long-Term Open-Domain Conversation

J Xu. Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021

work page arXiv 2021

[37] [37]

Chain-of-note: Enhancing robustness in retrieval-augmented language models,

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models.arXiv preprint arXiv:2311.09210, 2023

work page arXiv 2023

[38] [38]

Augmentation-adapted retriever improves generalization of language models as generic plug-in,

Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. Augmentation-adapted retriever improves generalization of language models as generic plug-in.arXiv preprint arXiv:2305.17331, 2023

work page arXiv 2023

[39] [39]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. 12 Contents 1 Introduction 1 2 Related Work 2 2.1 Memory for LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2024

[40] [40]

Identifying the most salient keywords (focus on nouns, verbs, and key concepts)

work page

[41] [41]

Extracting core themes and contextual elements

work page

[42] [42]

keywords

Creating relevant categorical tags Format the response as a JSON object: { "keywords": [ // several specific, distinct keywords that capture key concepts and terminology // Order from most to least important // Don’t include keywords that are the name of the speaker or time // At least three keywords, but don’t be too redundant. ], "context": // one sente...

work page

[43] [43]

should_evolve

What specific actions should be taken (strengthen, update_neighbor)? 1.1 If choose to strengthen the connection, which memory should it be connected to? Can you give the updated tags of this memory? 1.2 If choose to update neighbor, you can update the context and tags of these memories based on the understanding of these memories. Tags should be determine...

work page 2023

[44] [44]

• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025