super hub Canonical reference

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Ion Stoica, Kevin Lin, Sarah Wooders, Shishir G. Patil, Vivian Fang · 2023 · cs.AI · arXiv 2310.08560

Canonical reference. 77% of citing Pith papers cite this work as background.

323 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 323 citing papers more from Charles Packer arXiv PDF

abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 3 dataset 3 method 1 other 1

citation-polarity summary

background 34 baseline 3 use dataset 3 support 2 unclear 1 use method 1

claims ledger

abstract Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers i

authors

Charles Packer Ion Stoica Kevin Lin Sarah Wooders Shishir G. Patil Vivian Fang

co-cited works

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

cs.CL · 2026-04-17 · unverdicted · novelty 8.0 · 2 refs

MemEvoBench is presented as the first standardized benchmark for long-horizon memory safety in LLM agents, covering adversarial memory injection, noisy tool outputs, and biased feedback across QA and workflow tasks.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.

When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

cs.DB · 2026-07-01 · unverdicted · novelty 7.0

SOLAR is a learning-augmented policy for semantic cache replacement that achieves constant competitive ratio 3 and 5-75% gains over FIFO on retrieval workloads.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

HyphaeDB: A Living Knowledge Topology for Agent-First Memory

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.

LLM agents security duality: a comprehensive survey of self-security and empowered cybersecurity

cs.CR · 2026-06-26 · unverdicted · novelty 7.0

A survey of LLM agent self-security threats and mitigations alongside their applications in the cybersecurity lifecycle, introducing a synergy concept and empowerment framework.

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

cs.SE · 2026-06-17 · unverdicted · novelty 7.0

StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM performance on tight coordination and large-scale tasks.

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

cs.AI · 2026-06-16 · unverdicted · novelty 7.0

PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.

MemTrace: Probing What Final Accuracy Misses in Long-Term Memory

cs.AI · 2026-06-15 · unverdicted · novelty 7.0

MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.

Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems

cs.LG · 2026-06-15 · accept · novelty 7.0

Formalizes four concurrency anomalies in multi-agent LLM systems and mechanically verifies a hierarchy of sound detectors and preventions realized in Rust runtimes using TLA+ and Verus.

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

cs.CL · 2026-06-14 · unverdicted · novelty 7.0

An empirical comparison of thirteen control-plane placements in agent memory pipelines identifies three regimes with complementary forgetting recovery on a new 385-case adversarial benchmark, with mutation-time placement achieving 91.7-93.2% overall.

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

DCPM reorganizes LLM agent memory into a cognitive hierarchy driven by a synchronous daytime belief writer and an asynchronous nighttime schema engine, reporting gains on cross-session inference benchmarks.

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.

citing papers explorer

Showing 50 of 130 citing papers after filters.

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments cs.AI · 2026-06-04 · unverdicted · none · ref 27 · internal anchor
CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 3 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Self-GC: Self-Governing Context for Long-Horizon LLM Agents cs.AI · 2026-07-01 · unverdicted · none · ref 52 · internal anchor
Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents cs.AI · 2026-06-29 · unverdicted · none · ref 18 · internal anchor
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
HyphaeDB: A Living Knowledge Topology for Agent-First Memory cs.AI · 2026-06-27 · unverdicted · none · ref 60 · internal anchor
HyphaeDB introduces an agent-native memory system using HNSW topology for gossip-based knowledge propagation, enabling emergent behaviors in multi-agent AI.
User as Engram: Internalizing Per-User Memory as Local Parametric Edits cs.AI · 2026-06-17 · unverdicted · none · ref 59 · internal anchor
User facts are internalized as surgical local edits to a hash-keyed Engram memory table with reasoning skill held in a shared adapter, claimed to match LoRA recall, improve indirect reasoning 5.6x on average, and compose across users with 33,000x smaller footprint than per-user adapters.
RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models cs.AI · 2026-06-17 · unverdicted · none · ref 36 · internal anchor
RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM performance on tight coordination and large-scale tasks.
PreAct: Computer-Using Agents that Get Faster on Repeated Tasks cs.AI · 2026-06-16 · unverdicted · none · ref 32 · internal anchor
PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory cs.AI · 2026-06-15 · unverdicted · none · ref 47 · internal anchor
MemTrace shows that evidence utilization, not retrieval, is the dominant failure mode in LLM long-term memory systems across tested configurations.
Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents cs.AI · 2026-06-09 · unverdicted · none · ref 44 · internal anchor
OSL-MR is a learning-augmented framework that casts memory retention as constrained stochastic optimization under partial observability and outperforms heuristic baselines on LoCoMo and LongMemEval.
Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads cs.AI · 2026-06-04 · unverdicted · none · ref 16 · internal anchor
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents cs.AI · 2026-06-04 · unverdicted · none · ref 5 · internal anchor
SubtleMemory benchmark with 1,522 instances over 10 histories shows current memory systems are weak at fine-grained relational discrimination in long-term AI agent interactions.
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale cs.AI · 2026-06-02 · unverdicted · none · ref 10 · internal anchor
SkillDAG builds a self-evolving typed skill graph that LLM agents query and update at inference time, raising success on ALFWorld and SkillsBench by 12.8 and 8.6 points over graph baselines.
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions cs.AI · 2026-05-26 · unverdicted · none · ref 34 · internal anchor
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
MemFail: Stress-Testing Failure Modes of LLM Memory Systems cs.AI · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
MemFail introduces diagnostic datasets that isolate failure modes in LLM memory systems by testing summarization, storage, and retrieval operations separately.
AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents cs.AI · 2026-05-26 · unverdicted · none · ref 22 · internal anchor
AGORA is an inference-free step-level compressor for LLM agent prompts that retains at least 75% of uncompressed performance in most tested settings where token-level methods collapse due to action-grammar destruction.
EXG: Self-Evolving Agents with Experience Graphs cs.AI · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
MMSkills: Towards Multimodal Skills for General Visual Agents cs.AI · 2026-05-13 · unverdicted · none · ref 22 · 3 links · internal anchor
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales cs.AI · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory cs.AI · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning cs.AI · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory cs.AI · 2026-05-08 · unverdicted · none · ref 40 · internal anchor
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory cs.AI · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates cs.AI · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing cs.AI · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents cs.AI · 2026-04-21 · unverdicted · none · ref 6 · internal anchor
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture cs.AI · 2026-04-14 · unverdicted · none · ref 2 · internal anchor
A hybrid SNN-LLM system uses learned spiking dynamics and lateral STDP propagation to trigger LLM actions without external prompts, producing the first autonomous action after 7 exchanges from a clean start.
When to Forget: A Memory Governance Primitive cs.AI · 2026-04-13 · unverdicted · none · ref 5 · internal anchor
Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents cs.AI · 2026-04-11 · unverdicted · none · ref 31 · internal anchor
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception cs.AI · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent self-diagnosed bugs and maintained cross-channel context.
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems cs.AI · 2026-04-06 · unverdicted · none · ref 20 · internal anchor
SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking cs.AI · 2026-03-28 · unverdicted · none · ref 12 · internal anchor
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory cs.AI · 2026-01-29 · unverdicted · none · ref 4 · internal anchor
E-mem uses a heterogeneous multi-agent setup for episodic context reconstruction in LLM agents, reaching over 54% F1 on LoCoMo while cutting token cost by over 70% compared to prior methods like GAM.
$How^{2}$: How to learn from procedural How-to questions cs.AI · 2025-10-13 · unverdicted · none · ref 1 · internal anchor
$How^{2}$ is a memory agent framework enabling agents to ask, store, and reuse answers to how-to questions at varying abstraction levels for better lifelong planning in environments like Plancraft.
Episodic-to-Semantic Consolidation Without Identity Drift cs.AI · 2026-07-02 · unverdicted · none · ref 13 · internal anchor
A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
AutoMem: Automated Learning of Memory as a Cognitive Skill cs.AI · 2026-07-01 · unverdicted · none · ref 8 · internal anchor
AutoMem automates memory structure revision and proficiency training in LLMs, delivering 2x-4x performance gains on long-horizon games without altering task-action behavior.
ManimAgent: Self-Evolving Multimodal Agents for Visual Education cs.AI · 2026-06-29 · unverdicted · none · ref 8 · 2 links · internal anchor
ManimAgent improves Manim animation code generation by maintaining a self-growing dual-channel episodic memory of validated successes and failures derived entirely from its own task stream.
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents cs.AI · 2026-06-21 · unverdicted · none · ref 3 · internal anchor
Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.
Recursive Self-Evolving Agents via Held-Out Selection cs.AI · 2026-06-17 · unverdicted · none · ref 14 · internal anchor
RSEA adds a strict held-out keep-better gate to recursive self-evolution of agent artifacts, yielding monotone-safe gains or parity with the base ReAct agent on ALFWorld, GAIA, τ-bench, and WebShop.
Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So cs.AI · 2026-06-16 · unverdicted · none · ref 45 · internal anchor
Flash endurance is priced via shadow price η making placement cost-optimal for any sign of value-write correlation χ, with χ positive only in recurrent long-horizon manipulation and the budget binding only on low-endurance commodity hardware.
GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge cs.AI · 2026-06-12 · unverdicted · none · ref 12 · internal anchor
GitOfThoughts stores agent reasoning as a git repo and shows memory from past problems improves accuracy only when new problems are nearly identical (cosine similarity >0.8), with self-consistency providing the main gain on novel tasks.
Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory cs.AI · 2026-06-11 · unverdicted · none · ref 11 · internal anchor
A learned linear multi-factor value model over seven cognitive psychology factors retains 0.770 gold evidence on LongMemEval blind regime versus 0.368 for recency and 0.518 for best single factor.
Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce cs.AI · 2026-06-11 · unverdicted · none · ref 15 · internal anchor
A modular two-agent simulation framework enables controlled comparison of conversational e-commerce responders, showing rolling-window memory outperforms intent extraction and targeted fixes reduce failures by 62%.
Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents cs.AI · 2026-06-10 · unverdicted · none · ref 24 · internal anchor
HORMA builds a hierarchical memory structure from agent experiences and trains a lightweight RL navigator to retrieve minimal sufficient context, yielding better task performance with at most 22.17% of baseline token usage on ALFWorld, LoCoMo, and LongMemEval.
SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows cs.AI · 2026-06-06 · unverdicted · none · ref 30 · internal anchor
SKILL.nb uses selective formalization and gate-conditioned execution in auditable notebooks to improve durability of agent workflows, achieving 53.7% success on WebArena-Verified with 91.7% retention across re-executions.
TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management cs.AI · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
TokenMizer builds a knowledge graph of LLM sessions and serializes it into 78-token resume blocks that retain more task, decision, and file information than flat-text baselines at roughly half the token cost.
Scaling Self-Evolving Agents via Parametric Memory cs.AI · 2026-06-03 · unverdicted · none · ref 11 · internal anchor
TMEM lets LLM agents evolve their policy mid-episode by absorbing distilled supervision into online LoRA updates, outperforming summary and retrieval baselines on several long-context benchmarks.
SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision cs.AI · 2026-05-31 · unverdicted · none · ref 32 · internal anchor
SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.
Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison cs.AI · 2026-05-28 · conditional · none · ref 27 · internal anchor
Introduces a benchmark with 34,560 instances for selective QA over conflicting multi-source personal memory and compares fusion methods against LLMs.
GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents cs.AI · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
GRASP adds a regression-aware acceptance gate to skill proposal for LLM agents, producing large gains on clinical benchmarks while preventing silent regressions on prior behavior.

MemGPT: Towards LLMs as Operating Systems

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer