arxiv: 2310.08560 · v2 · submitted 2023-10-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Ion Stoica, Joseph E. Gonzalez, Kevin Lin, Sarah Wooders, Shishir G. Patil, Vivian Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsmemory managementcontext windowvirtual memorydocument analysisconversational agentsoperating systems

0 comments

The pith

MemGPT treats an LLM like an operating system that moves data between memory tiers to work with contexts larger than its fixed window.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a technique called virtual context management that borrows from how traditional operating systems use fast and slow memory layers to create the illusion of abundant resources. MemGPT applies this idea so the model itself decides what to keep in its immediate context, what to store elsewhere, and when to pause for user input via interrupts. This setup lets the LLM analyze documents far longer than its native window allows and sustain multi-turn conversations that evolve over many sessions without dropping prior details. A reader would care because fixed context limits currently block LLMs from handling realistic workloads like full-book review or ongoing personal assistance.

Core claim

By organizing memory into hierarchical tiers and letting the LLM control movement between them, MemGPT supplies the functional equivalent of an arbitrarily large context window while still operating inside the model's actual limit. The system also uses interrupts to shift control between the model and the user, allowing dynamic, long-running interactions in two evaluated settings: document analysis of texts that exceed the base context size and multi-session chat agents that retain, reflect on, and update knowledge across separate conversations.

What carries the argument

Virtual context management, the mechanism that moves information between fast (in-window) and slow (out-of-window) tiers under the LLM's own control.

If this is right

Document analysis tasks become possible for texts many times longer than the underlying model's context window.
Conversational agents can maintain coherent state across dozens of separate sessions instead of resetting at each new window.
Control flow between model and user can be managed through explicit interrupts rather than one-shot prompting.
The same tiered-memory pattern can be applied to any task that needs more context than the base LLM supplies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design opens the possibility of building persistent agents whose knowledge base grows continuously without manual context curation.
Similar tier-management logic could be tested on other constrained resources, such as tool-use histories or external knowledge bases.
If the self-managed interrupts prove stable, the approach might reduce the need for human-crafted prompts that explicitly summarize prior turns.

Load-bearing premise

The LLM can correctly decide on its own which pieces of information belong in which memory tier and when to trigger an interrupt without dropping or misinterpreting essential facts.

What would settle it

A controlled run in which a key fact from a long document is moved to a lower tier and then the model is asked a direct question that requires that fact; failure to retrieve or apply it correctly would show the memory-management logic is unreliable.

read the original abstract

Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemGPT gives LLMs an OS-style virtual memory layer for long contexts, with a working system and released code, but the core claim that the model reliably manages tiers lacks visible error metrics.

read the letter

The main takeaway is that this paper builds a concrete system called MemGPT that treats the LLM's context window like limited RAM and uses virtual memory techniques to swap information in and out from slower tiers. It also adds interrupts so the model can pause and hand control back to the user or external processes. They demonstrate it on two practical cases: reading documents much longer than the base model's window and running multi-turn chats where the agent keeps state across sessions and reflects on past interactions. The code and data release is useful because it lets anyone reproduce the setup and see how the tier management actually runs in practice. That engineering focus is the part that feels fresh relative to earlier retrieval-augmented or memory-augmented LLM work. The soft spot is exactly where the stress-test note points: the system depends on the LLM itself deciding what to move between tiers and when to trigger an interrupt, yet the abstract gives no numbers on decision accuracy, eviction mistakes, or context loss rates. If those decisions are noisy, the whole virtual-context guarantee weakens even if the final task outputs look reasonable. The full paper may contain ablations or failure analysis that address this, but nothing in the provided summary shows it. This work is aimed at people building persistent agents or long-document tools who want a ready-to-try architecture rather than a theoretical advance. A reader already working on LLM systems would find the implementation details and open code worth examining. I would send it to peer review because the framing is distinct enough and the artifact is available for others to inspect and extend.

Referee Report

2 major / 2 minor

Summary. The paper introduces MemGPT, a system that applies virtual context management—modeled on OS hierarchical memory—to allow LLMs to operate over contexts larger than their fixed window. The LLM itself decides data movement between a limited main-memory tier (the context window) and slower external storage, while interrupts handle control flow with the user. Evaluations are described for two settings: analysis of documents exceeding the model’s context length and multi-session chat agents that maintain, reflect on, and evolve state across interactions. Code and data are released.

Significance. If the LLM-driven tier management proves reliable, the approach supplies a practical, model-agnostic route to long-context capabilities without retraining or larger windows. The OS analogy yields a reusable design pattern for agentic systems. Releasing code and data is a clear strength that aids follow-on work.

major comments (2)

[§4] §4 (Document Analysis Experiments): The manuscript reports that MemGPT can process documents far larger than the base LLM’s context window, yet supplies no quantitative metrics on retrieval precision, incorrect eviction rate, or context-loss frequency across tier movements. Without these, it is impossible to determine whether the claimed performance stems from successful virtual-context management or from the underlying LLM’s robustness to partial information.
[§3, §5] §3 (System Design) and §5 (Multi-Session Chat): The central assumption that the LLM can “intelligently” decide what to move between memory tiers and when to issue interrupts is stated without accompanying error analysis, prompt templates, or ablation on decision accuracy. If these decisions systematically drop critical facts, the virtual-context guarantee does not hold even when final-task accuracy appears high.

minor comments (2)

[Abstract, §1] The abstract and introduction use the phrase “intelligently manages” without defining the term; a short operational definition or reference to the decision procedure would improve clarity.
[Figure 1] Figure 1 (memory hierarchy diagram) would benefit from explicit labels for each tier’s capacity and access latency relative to the LLM context window.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of evaluation and system transparency that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [§4] §4 (Document Analysis Experiments): The manuscript reports that MemGPT can process documents far larger than the base LLM’s context window, yet supplies no quantitative metrics on retrieval precision, incorrect eviction rate, or context-loss frequency across tier movements. Without these, it is impossible to determine whether the claimed performance stems from successful virtual-context management or from the underlying LLM’s robustness to partial information.

Authors: We agree that internal metrics on the memory management layer would provide stronger evidence for the efficacy of virtual context management. The current experiments emphasize end-to-end task accuracy on document QA because that is the practical capability being demonstrated. In the revised manuscript we will add quantitative analysis of retrieval behavior: we will log all function calls during document processing, compute precision/recall against ground-truth relevant passages where available, and report statistics on eviction frequency and observed context-loss events. We will also include a short discussion of cases where the LLM’s tier decisions appear suboptimal. revision: partial
Referee: [§3, §5] §3 (System Design) and §5 (Multi-Session Chat): The central assumption that the LLM can “intelligently” decide what to move between memory tiers and when to issue interrupts is stated without accompanying error analysis, prompt templates, or ablation on decision accuracy. If these decisions systematically drop critical facts, the virtual-context guarantee does not hold even when final-task accuracy appears high.

Authors: The prompt templates governing memory management and interrupt decisions are provided in the appendix; we will move the key templates into the main text of §3 for visibility. We acknowledge the absence of explicit error analysis or ablations on decision quality. In revision we will add (1) an ablation comparing MemGPT against a baseline that performs random or FIFO eviction instead of LLM-driven decisions, and (2) a qualitative error analysis section that examines representative decision traces, notes observed failure modes (e.g., premature eviction of facts later needed), and discusses how often such errors affect final answer quality. Because no oracle exists for optimal memory contents, the analysis will necessarily be qualitative and example-driven rather than exhaustive. revision: partial

Circularity Check

0 steps flagged

No circularity: engineering system with independent empirical validation

full rationale

The paper presents MemGPT as an applied system for hierarchical memory management in LLMs, inspired by OS concepts but implemented and evaluated directly through experiments on document analysis and multi-session chat. No equations, derivations, or predictions appear that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Central claims rest on observable task performance rather than tautological logic, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the LLM being able to perform reliable memory management decisions and on the interrupt mechanism functioning without external supervision.

axioms (1)

domain assumption LLMs can follow complex instructions to manage memory tiers and control flow
Invoked when describing how MemGPT intelligently manages memory and uses interrupts.

invented entities (1)

virtual context management no independent evidence
purpose: To provide the appearance of large context by moving data between memory tiers
New technique introduced in the paper to overcome limited context windows.

pith-pipeline@v0.9.0 · 5500 in / 1182 out tokens · 32376 ms · 2026-05-10T12:23:28.973649+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
MEME: Multi-entity & Evolving Memory Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

All tested LLM memory systems fail at dependency reasoning in multi-entity evolving scenarios, with only an expensive file-based setup showing partial recovery.
PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments
cs.RO 2026-05 unverdicted novelty 7.0

PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
cs.AI 2026-05 unverdicted novelty 7.0

EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
cs.AI 2026-05 unverdicted novelty 7.0

TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
cs.AI 2026-05 unverdicted novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of basel...
Stateful Agent Backdoor
cs.CR 2026-05 unverdicted novelty 7.0

A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting
cs.CL 2026-05 unverdicted novelty 7.0

Telegraph English compresses prompts via structured symbolic rewriting into atomic facts, achieving roughly 50% token reduction with 99.1% key-fact accuracy on LongBench-v2 and outperforming token-deletion baselines a...
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
cs.CR 2026-05 unverdicted novelty 7.0

SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

MemEvoBench is the first benchmark for long-horizon memory safety in LLM agents, using QA tasks across 7 domains and 36 risks plus workflow tasks with noisy tools to measure behavioral drift from biased memory updates.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
cs.CL 2026-04 conditional novelty 7.0

GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
cs.IR 2026-04 conditional novelty 7.0

vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BE...
SAGER: Self-Evolving User Policy Skills for Recommendation Agent
cs.IR 2026-04 unverdicted novelty 7.0

SAGER equips LLM recommendation agents with per-user evolving policy skills via two-representation architecture, contrastive CoT diagnosis, and skill-augmented listwise reasoning, yielding SOTA gains orthogonal to mem...
IE as Cache: Information Extraction Enhanced Agentic Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture
cs.AI 2026-04 unverdicted novelty 7.0

A hybrid SNN-LLM system uses learned spiking dynamics and lateral STDP propagation to trigger LLM actions without external prompts, producing the first autonomous action after 7 exchanges from a clean start.
When to Forget: A Memory Governance Primitive
cs.AI 2026-04 unverdicted novelty 7.0

Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
cs.AI 2026-04 unverdicted novelty 7.0

Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
SuperLocalMemory V3.3: The Living Brain -- Biologically-Inspired Forgetting, Cognitive Quantization, and Multi-Channel Retrieval for Zero-LLM Agent Memory Systems
cs.AI 2026-04 unverdicted novelty 7.0

SuperLocalMemory V3.3 implements a cognitive memory taxonomy with mathematical forgetting and multi-channel retrieval, reaching 70.4% on LoCoMo in zero-LLM mode.
MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration
cond-mat.mtrl-sci 2026-04 conditional novelty 7.0

MatClaw is a code-first LLM agent that autonomously executes end-to-end materials workflows by generating and running Python scripts on remote clusters, achieving reliable code generation via memory architecture and R...
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
cs.AI 2026-03 unverdicted novelty 7.0

WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
Cognifold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
An Annotation Scheme and Classifier for Personal Facts in Dialogue
cs.CL 2026-05 accept novelty 6.0

An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 ...
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL lets LLMs self-generate sparse layer updates via meta-RL to consolidate knowledge from context, outperforming prompting and fine-tuning baselines on QA and long-context tasks while aligning updates with high-Fish...
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
cs.CL 2026-05 unverdicted novelty 6.0

SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
cs.CL 2026-05 conditional novelty 6.0

True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
cs.AI 2026-05 unverdicted novelty 6.0

ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
cs.AI 2026-05 unverdicted novelty 6.0

MEMTIER delivers 38% accuracy on the 500-question LongMemEval-S benchmark with a 7B model on 6GB GPU, a 33-point gain over full-context baselines, via structured episodic memory, five-signal retrieval, and semantic co...
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
cs.CR 2026-05 unverdicted novelty 6.0

Policy directives can be lost during context assembly in language model agents, leading to unprompted policy violations that SafeContext can partially prevent.
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
cs.CR 2026-04 conditional novelty 6.0

AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
cs.AI 2026-04 unverdicted novelty 6.0

Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
StructMem: Structured Memory for Long-Horizon Behavior in LLMs
cs.CL 2026-04 unverdicted novelty 6.0

StructMem is a structure-enriched hierarchical memory system that improves temporal reasoning and multi-hop QA on LoCoMo while cutting token usage, API calls, and runtime versus prior flat or graph-based memories.
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
cs.CL 2026-04 unverdicted novelty 6.0

ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
Stateless Decision Memory for Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
cs.CL 2026-04 unverdicted novelty 6.0

HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization
cs.AI 2026-04 unverdicted novelty 6.0

SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weigh...
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
cs.CL 2026-04 unverdicted novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
Visual Inception: Compromising Long-term Planning in Agentic Recommenders via Multimodal Memory Poisoning
cs.CR 2026-04 unverdicted novelty 6.0

Visual Inception poisons images to hijack long-term memory in agentic recommenders and steer planning, while CognitiveGuard reduces success to about 10% via perceptual sanitization and reasoning verification.
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI
cs.CL 2026-04 unverdicted novelty 6.0

APEX-MEM uses property graphs with temporal events, append-only storage, and an agentic retrieval system to reach 88.88% accuracy on LOCOMO QA and 86.2% on LongMemEval, outperforming prior session-aware methods.
AgentSPEX: An Agent SPecification and EXecution Language
cs.CL 2026-04 unverdicted novelty 6.0

AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 105 Pith papers · 14 internal anchors

[1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review arXiv 2004
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Ad- vances in neural information processing systems , 33: 1877–1901,

work page 1901
[3]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large lan- guage models via positional interpolation.arXiv preprint arXiv:2306.15595,

work page internal anchor Pith review arXiv
[4]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse trans- formers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review arXiv 1904
[5]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length con- text. arXiv preprint arXiv:1901.02860,

work page Pith review arXiv 1901
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

A survey on long text modeling with transformers.arXiv preprint arXiv:2302.14502,

Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with transformers.arXiv preprint arXiv:2302.14502,

work page arXiv
[8]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain ques- tion answering. arXiv preprint arXiv:2007.01282,

work page arXiv 2007
[9]

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- bastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118,

work page internal anchor Pith review arXiv
[10]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983,

work page arXiv
[11]

arXiv preprint arXiv:2004.04906 , year=

Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 ,

work page arXiv 2004
[12]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Re- former: The efficient transformer. arXiv preprint arXiv:2001.04451,

work page internal anchor Pith review arXiv 2001
[13]

Available: https://doi.org/10.1162/tacl a 00449

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long con- texts. arXiv preprint arXiv:2307.03172, 2023a. Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Ke- juan Yang, et al. AgentBench: Evaluating ll...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human be- havior. arXiv preprint arXiv:2304.03442,

work page internal anchor Pith review arXiv
[15]

A case for redundant arrays of inexpensive disks (raid)

David A Patterson, Garth Gibson, and Randy H Katz. A case for redundant arrays of inexpensive disks (raid). In Proceedings of the 1988 ACM SIGMOD international conference on Management of data , pp. 109–116,

work page 1988
[16]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review arXiv
[17]

ArXiv preprint abs/2302.00083 (2023)

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhl- gay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language mod- els. arXiv preprint arXiv:2302.00083,

work page arXiv
[18]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,

work page internal anchor Pith review arXiv
[19]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

innocent until proven guilty

H. Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. ArXiv, abs/2212.10509,

work page arXiv
[21]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear com- plexity. arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review arXiv 2006
[22]

Beyond goldfish memory: Long-term open-domain conversation.arXiv preprint arXiv:2107.07567, 2021

Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567,

work page arXiv
[23]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a- judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review arXiv