pith. machine review for the scientific record. sign in

arxiv: 2604.11811 · v1 · submitted 2026-04-10 · 💻 cs.PL · cs.AI· cs.CL· cs.LG

Recognition: no theorem link

M^star: Every Task Deserves Its Own Memory Harness

Mirror Xu, Shiwei Zhang, Shujie Liu, Wanlu Shi, Wenbo Pan, Xiangyang Zhou, Xiaohua Jia

Pith reviewed 2026-05-10 17:34 UTC · model grok-4.3

classification 💻 cs.PL cs.AIcs.CLcs.LG
keywords memory systemsLLM agentsprogram evolutiontask optimizationreflective searchagent architecturedomain specializationcode generation
0
0 comments X

The pith

LLM agents perform better with task-specific memory programs evolved as Python code than with any fixed shared design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that memory systems for large language model agents work best when each task receives its own specialized harness rather than a single fixed architecture. It represents memory as an executable Python program that combines a data schema, storage and retrieval logic, and workflow instructions. These programs are discovered automatically by a reflective, population-based evolutionary search that generates candidates and refines them after examining failures on the target task. Across four benchmarks covering conversation, embodied planning, and expert reasoning, the evolved programs deliver higher performance than standard fixed-memory baselines while developing visibly different internal structures for each domain.

Core claim

M* models an agent memory system as a Python program that jointly defines data Schema, storage Logic, and agent workflow Instructions. It optimizes these components together through reflective code evolution that maintains a population of candidate programs and iteratively improves them by analyzing evaluation failures. On four distinct benchmarks the resulting programs outperform fixed-memory baselines and exhibit structurally distinct processing mechanisms tailored to each domain.

What carries the argument

The memory program, a single Python script that encodes Schema for data organization, Logic for storage and retrieval operations, and Instructions for how the agent interacts with the stored information, discovered jointly through reflective population-based evolution.

If this is right

  • Performance improves over fixed-memory baselines on conversation, embodied planning, and expert reasoning tasks.
  • Evolved programs develop structurally distinct mechanisms for each domain instead of converging to one general form.
  • Joint optimization of schema, logic, and instructions explores a broader design space than hand-crafted fixed systems.
  • Specialized per-task memory yields better results than general-purpose memory paradigms across the evaluated settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evolution method could be applied to other reusable agent components such as planning modules or tool interfaces.
  • The domain-specific structures suggest that attempts to find one universal memory architecture for all agent tasks may be fundamentally limited.
  • Testing whether evolved programs transfer to related but unseen tasks would clarify the scope of the discovered specializations.
  • Extending the approach to longer or more open-ended interactions might expose scalability limits of the current failure-analysis loop.

Load-bearing premise

Reflective population-based code evolution applied to the chosen benchmarks will reliably produce memory programs that are both superior in performance and generalizable beyond the specific tasks and failure modes tested.

What would settle it

A new benchmark task on which no evolved memory program outperforms a carefully tuned fixed-memory baseline, or on which evolved programs from different domains converge to identical internal structures.

Figures

Figures reproduced from arXiv: 2604.11811 by Mirror Xu, Shiwei Zhang, Shujie Liu, Wanlu Shi, Wenbo Pan, Xiangyang Zhou, Xiaohua Jia.

Figure 1
Figure 1. Figure 1: Evolved memory harnesses across tasks. Starting from shared seeds (center), MSTAR evolves structurally distinct harnesses for each task. Each node is an evolved program. Abstract Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semant… view at source ↗
Figure 2
Figure 2. Figure 2: System overview of MSTAR. Starting from a seed memory program (0), the system maintains a population pool (1) and iteratively improves programs through evaluation on task episodes (2), LLM-guided reflection and code mutation (3), and compile/runtime quality checks (4). The best-scoring program is evaluated on a held-out test set (5). knowledge base is thus defined as a collection of these items. At test ti… view at source ↗
Figure 3
Figure 3. Figure 3: Evolution trajectory. Validation score across iterations for all benchmarks. Most benchmarks follow a common phased pattern: early iterations correct structural errors in seed programs, middle iterations produce the largest gains by discovering task-relevant indexing strategies, and later iterations refine retrieval precision with diminishing returns. precision refinements with smaller returns. This phased… view at source ↗
Figure 4
Figure 4. Figure 4: Program embedding landscape. Each evolved program is embedded with a code embedding model and projected to 2D via t-SNE. (a, b) Population-based search (MSTAR) explores structurally diverse regions of the program space, while linear search concentrates in a narrow neighborhood; colored edges trace parent – child lineage. (c) All programs across five benchmarks, colored by dataset. Marker shapes denote the … view at source ↗
Figure 5
Figure 5. Figure 5: Cross-task transfer of evolved memory harnesses. Each panel evaluates memory harnesses evolved on different source benchmarks against a single target benchmark. The dashed line marks the universal seed baseline. Programs evolved on their native task (highlighted) consistently outperform those transferred from other tasks, confirming that memory structure must be co-optimized with the target task. Joint evo… view at source ↗
read the original abstract

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M$^\star$, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M$^\star$ models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M$^\star$ on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M$^\star$ improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces M*, a method to automatically discover task-optimized memory harnesses for LLM agents by representing each memory system as an executable Python program (data Schema + storage Logic + agent Instructions) and optimizing it via reflective population-based code evolution that refines candidates by analyzing evaluation failures. It evaluates the approach on four benchmarks spanning conversation, embodied planning, and expert reasoning, claiming robust performance gains over fixed-memory baselines and the emergence of structurally distinct processing mechanisms tailored to each domain.

Significance. If the performance claims hold under proper controls, the work would be significant for demonstrating that automated search over memory program designs can outperform hand-crafted fixed architectures across domains, supporting the broader idea that memory mechanisms should be specialized rather than general-purpose. This could influence agent design by shifting focus toward evolutionary co-optimization of memory and task logic.

major comments (2)
  1. §4 (Evaluation): The abstract and evaluation claim 'robust' improvements over fixed-memory baselines on all four tasks, but no details are provided on the specific baselines, number of independent runs, statistical significance tests, or performance variance. This information is load-bearing for validating the central performance claim and distinguishing genuine gains from noise.
  2. Method and Evaluation sections: The optimization 'analyzes evaluation failures to iteratively refine the candidate programs,' yet the manuscript does not indicate whether these failures come from held-out data or the same benchmark distribution used for final reporting. Without this or explicit distribution-shift tests, the reported superiority risks arising from overfitting to task-specific artifacts rather than discovering generalizable memory mechanisms.
minor comments (2)
  1. The LaTeX notation M$^* $ should be introduced with a clear definition in the introduction and used consistently.
  2. Consider including a table or figure explicitly comparing the evolved program structures (Schema/Logic/Instructions) across the four domains to substantiate the 'structurally distinct' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights key areas where our evaluation and methodological descriptions can be strengthened. We address each major comment below and commit to revisions that improve clarity and rigor without altering the core contributions.

read point-by-point responses
  1. Referee: §4 (Evaluation): The abstract and evaluation claim 'robust' improvements over fixed-memory baselines on all four tasks, but no details are provided on the specific baselines, number of independent runs, statistical significance tests, or performance variance. This information is load-bearing for validating the central performance claim and distinguishing genuine gains from noise.

    Authors: We agree that these details are necessary to substantiate the robustness claims. In the revised manuscript, we will expand Section 4 (and add an experimental details appendix) to explicitly list the fixed-memory baselines with citations, report the number of independent runs conducted, include statistical significance testing (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values), and present performance variance via standard deviations or confidence intervals across runs. These additions will directly address the concern and allow readers to assess the reliability of the reported gains. revision: yes

  2. Referee: Method and Evaluation sections: The optimization 'analyzes evaluation failures to iteratively refine the candidate programs,' yet the manuscript does not indicate whether these failures come from held-out data or the same benchmark distribution used for final reporting. Without this or explicit distribution-shift tests, the reported superiority risks arising from overfitting to task-specific artifacts rather than discovering generalizable memory mechanisms.

    Authors: This is a fair and important point about potential overfitting. The current manuscript does not explicitly describe the data partitioning used during the reflective evolution process. In the revision, we will clarify in the Method section whether optimization failures are drawn from development splits (where available) versus the full benchmark distribution, and we will add explicit statements on how final results are computed on held-out portions. We will also incorporate additional experiments or ablations that test for distribution shift (e.g., cross-benchmark generalization or leave-one-task-out evaluations) to demonstrate that the evolved memory programs capture generalizable mechanisms rather than benchmark-specific artifacts. If certain benchmarks lack predefined splits, we will acknowledge this limitation and discuss its implications. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical search with no derivations or load-bearing self-references

full rationale

The paper describes an empirical procedure—population-based reflective code evolution that jointly optimizes Schema, Logic, and Instructions in Python memory programs—then reports measured performance gains on four benchmarks. No equations, first-principles derivations, or mathematical predictions exist that could reduce to fitted inputs or self-definitions by construction. The evolution process is presented as a search heuristic that inspects evaluation failures; the reported outcomes are direct experimental measurements rather than analytic claims. No self-citations are invoked to justify uniqueness or forbid alternatives, and no known empirical patterns are renamed as novel unification. The central claim therefore remains an independent empirical finding whose validity can be checked against the stated benchmarks and baselines without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the unproven premise that memory systems are best represented as jointly optimizable Python programs and that evolutionary search over them yields transferable gains. No specific numerical free parameters are named in the abstract.

axioms (2)
  • domain assumption A memory system optimized for one purpose frequently fails to transfer to others
    Stated in the abstract as the motivation for per-task optimization.
  • domain assumption Reflective code evolution with population-based search and failure analysis can iteratively improve candidate memory programs
    Core mechanism described but not justified in abstract.
invented entities (1)
  • memory program no independent evidence
    purpose: Encapsulates data Schema, storage Logic, and agent workflow Instructions as an executable Python artifact
    New modeling choice introduced to enable joint optimization; no independent evidence provided beyond the method itself.

pith-pipeline@v0.9.0 · 5539 in / 1460 out tokens · 108492 ms · 2026-05-10T17:34:40.321222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InICLR, 2026. Oral....

  2. [2]

    Prbench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.CoRR, abs/2511.11562, 2025

    Afra Feyza Akyurek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, et al. PRBench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.arXiv preprint arXiv:2511.11562, 2025

  3. [3]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

  4. [4]

    arXiv preprint arXiv:2511.06449 , year=

    Zhicheng Cai, Xinyuan Guo, Yu Pei, Jiangtao Feng, Jiangjie Chen, Ya-Qin Zhang, Wei-Ying Ma, Mingxuan Wang, and Hao Zhou. FLEX: Continuous agent evolution via forward learning from experience.arXiv preprint arXiv:2511.06449, 2025

  5. [5]

    EvoPrompting: Language models for code-level neural architecture search

    Angelica Chen, David Dohan, and David So. EvoPrompting: Language models for code-level neural architecture search. InNeurIPS, 2023

  6. [6]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  7. [7]

    ADAS: Automated design of agentic systems

    Shengran Hu, Cong Lu, and Jeff Clune. ADAS: Automated design of agentic systems. InICLR, 2025

  8. [8]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. AlphaEvolve: A coding agent for scientific and algor...

  9. [9]

    Le, Samira Daruki, Xiangru Tang, et al

    Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, et al. ReasoningBank: Scaling agent self-evolving with reasoning memory. InICLR, 2026

  10. [10]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InUIST, 2023

  11. [11]

    SQuAD: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEMNLP, 2016

  12. [12]

    Mathematical discoveries from program search with large language models

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 625:468–475, 2024. 12

  13. [13]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023

  14. [14]

    Evaluating memory structure in llm agents,

    Alina Shutova, Alexandra Olenina, Ivan Vinogradov, and Anton Sinitsin. Evaluating memory structure in LLM agents.arXiv preprint arXiv:2602.11243, 2026

  15. [15]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024

  16. [16]

    Augmenting language models with long-term memory

    Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InNeurIPS, 2023

  17. [17]

    Agent workflow memory

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. InICML, 2025

  18. [18]

    arXiv preprint arXiv:2511.20857 , year=

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, and Derek Zhiyuan Cheng. Evo-Memory: Benchmarking LLM agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

  19. [19]

    Learning to continually learn via meta-learning agentic memory designs.arXiv preprint arXiv:2602.07755, 2026

    Yiming Xiong, Shengran Hu, and Jeff Clune. Learning to continually learn via meta-learning agentic memory designs.arXiv preprint arXiv:2602.07755, 2026

  20. [20]

    Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. MemEvolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

  21. [21]

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

    Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. MemSkill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

  22. [22]

    A survey on the memory mechanism of large language model based agents.ACM Transactions on Information Systems, 2025

    Zeyu Zhang, Xiaohe Zhang, Yuanpei Wang, Shengjie Yan, and Rui Sun. A survey on the memory mechanism of large language model based agents.ACM Transactions on Information Systems, 2025

  23. [23]

    ExpeL: LLM agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InAAAI, 2024

  24. [24]

    ""Vanilla␣RAG:␣store␣text␣chunks␣in␣ChromaDB,␣retrieve␣by␣semantic␣similarity

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory. InAAAI, 2024. A Algorithm B Dataset Details Table 5 summarizes the data splits for each benchmark configuration. For LoCoMo, we exclude category 5 (adversarial and unanswerable questions), which tests refusal capability rather ...

  25. [25]

    description

    **KnowledgeItem** (dataclass): Defines what information is captured as knowledge items when writing to the knowledge base. - Must be a @dataclass with typed fields - An external LLM will populate instances by generating JSON matching your field definitions - **Field types MUST be JSON-compatible**: use only str, int, float, bool, list[str], Optional[str] ...

  26. [26]

    - Same constraints as KnowledgeItem

    **Query** (dataclass): Defines what parameters are used when reading from the knowledge base. - Same constraints as KnowledgeItem

  27. [27]

    **KnowledgeBase** (class): The core knowledge base system. - ‘__init__(self, toolkit)‘: Receives a Toolkit with: - ‘toolkit.db‘: sqlite3.Connection (in-memory SQLite) - ‘toolkit.chroma‘: chromadb ephemeral client - ‘toolkit.llm_completion(messages, **kwargs) -> str‘: LLM for reasoning, summarization, and information extraction (1 call per write/read invoc...

  28. [28]

    **Memory Design**: Improve KnowledgeItem/Query field schemas and KnowledgeBase read()/write() logic to store and retrieve more useful information for the task agent

  29. [29]

    {iteration}

    Add clear comments explaining WHY each part of the code works the way it does -- this helps future iterations understand and preserve your design decisions. </rules> <patch_format> {PATCH_FORMAT_SPEC} </patch_format> <current_program iteration="{iteration}"> ‘‘‘python {code} ‘‘‘ </current_program> <evaluation_score>{score}</evaluation_score> {lineage_sect...

  30. [30]

    Diagnose why these cases scored low -- examine both the retrieval conversation AND the <model_generation> transcript for agent behavioral issues

  31. [31]

    Propose improvements along two dimensions: (A) **Prompt Optimization**: How should INSTRUCTION_* and ALWAYS_ON_KNOWLEDGE change to steer the task agent better? (B) **Memory Design**: How should the schemas or storage/retrieval logic change to provide more useful information?

  32. [32]

    a”, “an”, “the

    Output your changes as a patch. </task> Optional sections.The following sections are conditionally included when their data is available: • Lineage log: Evolution history of the current program’s lineage (ancestors, children, regres- sion markers), formatted as commit-style entries with delta scores. Regression markers ( ← REGRESSION) flag changes that hu...