Recognition: unknown
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
Pith reviewed 2026-05-10 17:07 UTC · model grok-4.3
The pith
Reasoning-driven selective extraction builds cleaner long-term memory for agents than passive transcription from dialogue.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemReader shifts memory population from passive one-shot transcription to active, reasoning-based decisions. The 4B model evaluates whether incoming information is valuable, unambiguous, and complete, then selectively writes structured entries, defers incomplete inputs, retrieves historical context, or discards chatter. Optimized via Group Relative Policy Optimization and tested under ReAct-style operation, it outperforms prior extraction baselines on LOCOMO, LongMemEval, and HaluMem, particularly on tasks requiring knowledge updates, temporal consistency, and hallucination reduction. The work concludes that effective long-term agent memory depends on reasoning-driven selectivity rather than
What carries the argument
The active decision loop in MemReader-4B that assesses information value, reference ambiguity, and completeness before selecting among write, defer, retrieve, or discard actions.
If this is right
- Agents maintain consistent knowledge across sessions by writing only verified, non-ambiguous facts.
- Hallucinations decrease because incomplete or cross-turn dependent information is deferred rather than stored.
- Memory remains low-noise and dynamically updatable instead of accumulating conflicting entries over time.
- Real-world systems gain from selective writes that reduce storage overhead while preserving necessary context.
- Integration into agent frameworks enables memory that evolves without manual cleanup.
Where Pith is reading between the lines
- The same active-reasoning pattern could be applied to other agent modules such as tool selection or plan revision to enforce selectivity across operations.
- Combining the decision process with retrieval-augmented generation might further strengthen temporal reasoning by pulling prior context before deciding to write.
- Latency introduced by the explicit reasoning step may require separate optimization when agents operate under strict response-time constraints.
- Extending the approach beyond dialogue to streams of documents or sensor data could test whether the value-assessment mechanism generalizes to non-conversational inputs.
Load-bearing premise
Benchmark improvements on knowledge updating and temporal tasks will carry over to lower memory pollution and better real-world agent behavior without introducing new failure modes from the added decision process.
What would settle it
A controlled deployment where agents using MemReader-4B are run for thousands of turns in open-ended dialogue and then measured for memory inconsistency rates or temporal query accuracy against passive baselines.
read the original abstract
Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the MemReader family for active long-term memory extraction in agent systems. MemReader-0.6B is a compact passive extractor distilled for schema-consistent structured outputs from dialogue, while MemReader-4B is an active extractor trained with Group Relative Policy Optimization (GRPO) under a ReAct-style paradigm. The active model explicitly evaluates information value, reference ambiguity, and completeness to decide among write, defer, retrieve, or discard actions. Experiments on LOCOMO, LongMemEval, and HaluMem report that MemReader-4B achieves state-of-the-art results on knowledge updating, temporal reasoning, and hallucination reduction, leading to the conclusion that effective agent memory requires reasoning-driven selective extraction rather than passive transcription to reduce pollution and inconsistency. The models are released with public API access and integrated into MemOS for real-world use.
Significance. If validated, the distinction between passive and active extraction could meaningfully advance long-term memory architectures for autonomous agents by prioritizing low-noise, dynamically maintained memory over exhaustive transcription. The open release of models and reported deployment in MemOS provide immediate community value and reproducibility. The work supplies a concrete mechanism (GRPO-driven decisions on value/ambiguity/completeness) that could be tested in other agent frameworks.
major comments (2)
- [Experimental Results] The central claim that MemReader-4B's active decisions produce lower-pollution memory than passive extraction (and that selective extraction is required) rests on benchmark task accuracy alone. No analysis is provided of the GRPO policy's decision correctness, such as precision/recall of memory writes against human-annotated important facts or error rates on cross-turn references and incomplete inputs. This is load-bearing because final-task SOTA scores could arise from the 4B base model, better schema adherence, or other factors rather than the active mechanism avoiding new failure modes like under-extraction.
- [Experiments] The manuscript reports SOTA performance on LOCOMO, LongMemEval, and HaluMem but provides insufficient detail on experimental setup, exact baselines, statistical significance testing, ablation studies isolating the active decision component from model scale (0.6B vs 4B), and error analysis. These omissions prevent assessment of whether the active policy genuinely improves memory quality without introducing misses that degrade downstream agent performance.
minor comments (2)
- [Abstract] The abstract states that MemReader 'consistently outperforms existing extraction-based baselines' without naming the primary baselines or their key differences; adding this would aid immediate comprehension.
- [Methods] Notation for the four actions (write/defer/retrieve/discard) and the three evaluation criteria (value/ambiguity/completeness) should be introduced with explicit definitions or a table in the methods section for clarity.
Simulated Author's Rebuttal
We appreciate the referee's insightful comments on our work. We provide detailed responses to the major comments and have updated the manuscript with additional experiments, analyses, and clarifications to address the concerns raised.
read point-by-point responses
-
Referee: [Experimental Results] The central claim that MemReader-4B's active decisions produce lower-pollution memory than passive extraction (and that selective extraction is required) rests on benchmark task accuracy alone. No analysis is provided of the GRPO policy's decision correctness, such as precision/recall of memory writes against human-annotated important facts or error rates on cross-turn references and incomplete inputs. This is load-bearing because final-task SOTA scores could arise from the 4B base model, better schema adherence, or other factors rather than the active mechanism avoiding new failure modes like under-extraction.
Authors: We acknowledge that direct validation of the GRPO policy decisions would strengthen the evidence for the active mechanism. Our primary evaluation focuses on downstream agent performance because that is the ultimate objective for long-term memory systems. In the revised manuscript, we have added a human annotation study on a sampled set of decisions, reporting precision and recall for write actions against important facts, as well as categorized error rates for defer, retrieve, and discard actions including cross-turn reference handling. We also include an ablation that replaces the learned policy with an always-write baseline on the same 4B model, showing degraded performance and thereby isolating the contribution of selective decisions. revision: yes
-
Referee: [Experiments] The manuscript reports SOTA performance on LOCOMO, LongMemEval, and HaluMem but provides insufficient detail on experimental setup, exact baselines, statistical significance testing, ablation studies isolating the active decision component from model scale (0.6B vs 4B), and error analysis. These omissions prevent assessment of whether the active policy genuinely improves memory quality without introducing misses that degrade downstream agent performance.
Authors: We agree that greater experimental transparency is needed. The revised manuscript substantially expands the Experiments section with complete details on setups, hyperparameters, prompt formats, and exact baseline reproductions. We now include statistical significance testing via paired t-tests with reported p-values for all primary results. Ablations have been added to isolate the active decision component, including same-scale (4B) passive versus active comparisons and scale-controlled variants. Error analysis on knowledge updating and temporal reasoning tasks has been incorporated to quantify any under-extraction effects and confirm they do not degrade overall agent performance. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivations or self-referential reductions
full rationale
The paper contains no equations, derivations, fitted parameters, or first-principles results that could reduce to their own inputs. All performance claims rest on direct comparisons against external benchmarks (LOCOMO, LongMemEval, HaluMem) using standard evaluation metrics. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises; the suggestion that active extraction is required follows from the reported SOTA scores rather than any definitional or fitted equivalence. The work is self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
MemPrivacy replaces privacy-sensitive spans with structured placeholders on edge devices to enable effective cloud memory management while limiting utility loss to 1.6% and outperforming general models on privacy extraction.
-
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
MemPrivacy uses edge detection of sensitive spans and type-aware placeholders to enable cloud-side memory management for LLM agents without exposing private data, achieving under 1.6% utility loss.
-
MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
MemPrivacy uses edge-side privacy span detection and semantic placeholders to enable cloud memory management for LLM agents while limiting utility loss to 1.6% and outperforming masking baselines.
Reference graph
Works this paper leans on
-
[1]
Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of llms.arXiv preprint arXiv:2504.15965, 2025
-
[2]
Memgpt: towards llms as operating systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023
2023
-
[3]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
Memos: A memory os for ai system
Zhiyu Li et al. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724, 2025
-
[6]
Text2mem: A unified memory operation language for memory operating system, 2025
Yi Wang, Lihai Yang, Boyu Chen, Gongyi Zou, Kerun Xu, Bo Tang, Feiyu Xiong, Siheng Chen, and Zhiyu Li. Text2mem: A unified memory operation language for memory operating system, 2025
2025
-
[7]
Inside out: Evolving user-centric core memory trees for long-term personalized dialogue systems, 2026
Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu, Bo Tang, Feiyu Xiong, and Zhiyu Li. Inside out: Evolving user-centric core memory trees for long-term personalized dialogue systems, 2026
2026
-
[8]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Liang Wang, Yixin Luo, Michael Kong, Yinchuan Yang, Jie Xie, Chuan Luo, Zheng Li, Lilin Shang, Xin Jiang, et al. Memorybank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250, 2023
-
[9]
uttler, Mike Lewis, Wen-tau Yih, Tim Rockt
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K"uttler, Mike Lewis, Wen-tau Yih, Tim Rockt"aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvancesin Neural Information Processing Systems, 2020
2020
-
[10]
Self-RAG: Learning to retrieve, generate, and critique through self-reflection
Akari Asai, Zeqiu Wu, Yizhong Wang, Avi Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations, 2024
2024
-
[11]
Mememo: Evaluating emotion in memory systems of agents, 2026
Peng Liu, Zhen Tao, Jihao Zhao, Ding Chen, Yansong Zhang, Cuiping Li, Zhiyu Li, and Hong Chen. Mememo: Evaluating emotion in memory systems of agents, 2026
2026
-
[12]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvancesin Neural Information Processing Systems, 2023
2023
-
[14]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Bashir Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023
2023
-
[15]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023.URL https://arxiv. org/abs/2305.16291, 2(11), 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 14
2023
-
[17]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Llamafactory: Unified efficient fine-tuning of 100+ language models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics
2024
-
[21]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv:2409.19256, 2024
work page internal anchor Pith review arXiv 2024
-
[23]
Evaluating Very Long-Term Conversational Memory of LLM Agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753, 2024
work page internal anchor Pith review arXiv 2024
-
[24]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. ICLR 2025
work page internal anchor Pith review arXiv 2024
-
[25]
Halumem: Evaluating hallucinations in memory systems of agents
Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025. A Appendix A. Prompt Template Examples This appendix provides practical prompt templates used by MemReader-0.6B and MemReader-4B for memory extracti...
-
[26]
If the message is from the user, extract user-relevant memories; if it is from the assistant, only extract factual memories that the user acknowledged or responded to
Identify information that reflects user's experiences, beliefs, concerns, decisions, plans, or reactions, including meaningful input from assistant that user acknowledged or responded to. If the message is from the user, extract user-relevant memories; if it is from the assistant, only extract factual memories that the user acknowledged or responded to
-
[27]
yesterday,
Resolve all time, person, and event references clearly: - Convert relative time expressions (e.g., "yesterday," "next Friday") into absolute dates using the message timestamp if possible. - Clearly distinguish between event time and message time. - If uncertainty exists, state it explicitly (e.g., "around June 2025," "exact date unclear"). - Include speci...
2025
-
[28]
The user
Always write from a third-person perspective, referring to user as "The user" or by name if name mentioned, rather than using first-person ("I", "me", "my"). For example, write "The user felt exhausted..." instead of "I felt exhausted..."
-
[29]
- Include all key experiences, thoughts, emotional responses, and plans -- even if they seem minor
Do not omit any information that user is likely to remember. - Include all key experiences, thoughts, emotional responses, and plans -- even if they seem minor. - Prioritize completeness and fidelity over conciseness. - Do not generalize or skip details that could be personally meaningful to user
-
[30]
memory list
Please avoid any content that violates national laws and regulations or involves politically sensitive information in the memories you extract. Return a single valid JSON object with the following structure: { "memory list": [ { "key": <string, a unique, concise memory title>, "memory_type": <string, Either "LongTermMemory" or "UserMemory">, "value": <A d...
-
[31]
add[]: Information is complete and important; extract immediately
-
[32]
search[query]: Review history. Use this when the dialogue contains pronouns (it, that person), specific references ( last time, that project), or implicit background, AND the answer likely exists in past conversation records
-
[33]
I plan to buy something
buffer[reason]: Wait for the future. Use this when the user starts a new topic but details are not yet unfolded (e.g., "I plan to buy something..."), or the information is brand new but too vague, requiring reliance on the user's future utterances to complete
-
[34]
ignore[reason]: No substantive content / small talk / repetition
-
[35]
"" B.2 B.2 F ew-Shot Examples Used for T eacher T race Generation 17 Gemini-3-Flash Few-Shot Template FEW_SHOT_EXAMPLES =
finish[action]: End processing and output the final decision (add/buffer/ignore). Only output finish[action]; the [] should contain ONLY the determined action keyword, not the specific content of the action. """ B.2 B.2 F ew-Shot Examples Used for T eacher T race Generation 17 Gemini-3-Flash Few-Shot Template FEW_SHOT_EXAMPLES = """ ## Example 1: Direct E...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.