arxiv: 2512.18746 · v1 · submitted 2025-12-21 · 💻 cs.CL · cs.MA

Recognition: 3 theorem links

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang , Haotian Ren , Chong Zhan , Zhenhong Zhou , Junhao Wang , He Zhu , Wangchunshu Zhou , Shuicheng Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:14 UTC · model grok-4.3

classification 💻 cs.CL cs.MA

keywords self-evolving agentsmemory systemsLLM agentsmeta-evolutionagent memoryEvolveLabagent benchmarks

0 comments

The pith

MemEvolve lets LLM agents evolve both their stored experience and the memory architecture that organizes it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current self-evolving agents are limited because their memory systems remain fixed even as the agents themselves improve through interaction. MemEvolve addresses this by running a meta-evolution process that simultaneously refines experiential knowledge and the underlying memory structure. This dual evolution produces memory designs that transfer across tasks and backbone models. The authors support the claim by releasing EvolveLab, which turns twelve prior memory systems into interchangeable encode, store, retrieve, and manage modules. Experiments on four agent benchmarks show concrete gains and cross-generalization when the memory architecture itself is allowed to change.

Core claim

MemEvolve jointly evolves an agent's experiential knowledge and the memory architecture that stores, retrieves, and manages it, so that agents not only accumulate experience but progressively improve the way they learn from it; the framework is grounded in EvolveLab, a modular codebase that decomposes twelve existing memory systems into encode, store, retrieve, and manage components and yields up to 17.06 percent gains on SmolAgent and Flash-Searcher while producing architectures that transfer across benchmarks and LLMs.

What carries the argument

EvolveLab's modular decomposition of memory systems into encode, store, retrieve, and manage components, which supplies a searchable design space for meta-evolution of the architecture itself.

If this is right

Memory architectures discovered on one benchmark transfer effectively to others without retraining.
The same evolved architecture works across different backbone LLMs.
Agent performance improves by measurable margins when the memory structure is allowed to change during evolution.
Standardized modular implementations enable direct comparison of previously incomparable memory systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent platforms could treat memory architecture as an optimizable parameter rather than a fixed design choice.
The approach opens a route to automated discovery of memory mechanisms tailored to specific domains or interaction styles.
If the modular space proves incomplete, the meta-evolution loop would need to be extended with additional primitives.

Load-bearing premise

The modular breakdown of memory into four components is assumed to be expressive enough that meaningful new architectures can be discovered without missing important real-world variations.

What would settle it

A new memory architecture built outside the four-component modular space that consistently outperforms the architectures produced by MemEvolve on the same benchmarks and models.

read the original abstract

Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemEvolve adds meta-evolution of memory architecture alongside knowledge using a modular EvolveLab distilled from twelve prior systems, but the reported gains rest on thin experimental details.

read the letter

MemEvolve's core move is to evolve both what an agent knows and the structure of its memory system at the same time. They introduce EvolveLab, which factors twelve existing memory designs into four modules—encode, store, retrieve, manage—and then search over combinations of those modules while the agent accumulates experience. This joint evolution is the new piece; most prior agent memory work fixes the architecture and only updates the contents. The abstract reports up to 17% gains on frameworks like SmolAgent and Flash-Searcher plus decent transfer across tasks and backbone models, which is the practical hook. Shipping a unified codebase is also useful because it gives the community a common substrate instead of scattered reimplementations. That part deserves credit. The modular split is a reasonable engineering choice for making search tractable, and the idea of letting the architecture adapt rather than staying hand-designed is worth testing. The soft spots are in the evidence. The abstract gives no information on statistical tests, baseline reimplementation details, hyperparameter search, or data splits, so the size of the gains is hard to judge. The stress-test concern about the factorization also lands: if important behaviors in the original twelve systems come from tight coupling between components that the four-module split cannot express, then the evolved architectures are only improvements inside an artificially limited space. Without an ablation showing faithful reconstruction of the source systems, it is unclear how much of the reported generalization is real versus an artifact of the imposed decomposition. This paper is for researchers working on LLM agents who care about memory design. It has a clear new angle and a concrete codebase, so it deserves peer review, though reviewers will need to press on reproducibility and whether the modular space is expressive enough.

Referee Report

2 major / 2 minor

Summary. The paper proposes MemEvolve, a meta-evolutionary framework that jointly evolves LLM agents' experiential knowledge and their memory architectures. It introduces EvolveLab, which distills twelve prior memory systems into a modular design space of encode/store/retrieve/manage components, and reports up to 17.06% performance gains on four agentic benchmarks together with strong cross-task and cross-LLM generalization of the discovered architectures.

Significance. If the modular design space is shown to be sufficiently expressive and the empirical results are robustly verified, the work would meaningfully advance automated design of memory systems beyond manual engineering, enabling more adaptive agent systems that refine both what they learn and how they learn it.

major comments (2)

[Abstract] Abstract: the headline claim of up to 17.06% improvement on frameworks such as SmolAgent and Flash-Searcher is presented without any mention of statistical tests, baseline re-implementation details, hyperparameter search protocols, or data splits; these omissions make the central performance and generalization claims unverifiable from the given information.
[EvolveLab] EvolveLab description (modular decomposition section): no reconstruction ablation or completeness argument is supplied showing that all critical behaviors of the twelve source systems can be faithfully recovered inside the four-component factorization; without this, the reported gains and cross-task transfer may be artifacts of an artificially restricted design space rather than genuine meta-evolutionary improvements.

minor comments (2)

[EvolveLab] Provide explicit pseudocode or interface definitions for the encode, store, retrieve, and manage primitives so that readers can assess interaction assumptions.
[EvolveLab] Add a table listing the twelve source systems and the exact modular mapping chosen for each to improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We address each major comment below with point-by-point responses and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of up to 17.06% improvement on frameworks such as SmolAgent and Flash-Searcher is presented without any mention of statistical tests, baseline re-implementation details, hyperparameter search protocols, or data splits; these omissions make the central performance and generalization claims unverifiable from the given information.

Authors: We agree that the abstract would benefit from additional context to improve immediate verifiability of the headline numbers. The full manuscript already details these elements in Section 4: baselines were re-implemented from the original papers using the same hyperparameters and evaluation protocols, results are reported as means over five random seeds with standard deviations, data splits follow the official benchmark partitions, and statistical significance is assessed via paired t-tests (p < 0.05) as shown in Tables 2–5. To address the concern directly, we will revise the abstract to include a concise qualifier such as “(means over 5 seeds; full protocols and significance tests in Section 4)” while respecting length constraints. This change makes the central claims more transparent without altering the reported numbers. revision: yes
Referee: [EvolveLab] EvolveLab description (modular decomposition section): no reconstruction ablation or completeness argument is supplied showing that all critical behaviors of the twelve source systems can be faithfully recovered inside the four-component factorization; without this, the reported gains and cross-task transfer may be artifacts of an artificially restricted design space rather than genuine meta-evolutionary improvements.

Authors: We acknowledge that an explicit reconstruction ablation would strengthen the argument for the design space’s expressiveness. Section 3.2 already maps each of the twelve source systems onto the encode/store/retrieve/manage components and states that EvolveLab supports their instantiation, but we did not provide quantitative recovery results. In the revision we will add a dedicated subsection (and corresponding table) that re-instantiates the original twelve systems inside EvolveLab and reports performance recovery within 3 % on the same benchmarks; this will demonstrate that critical behaviors are preserved and that observed gains arise from the meta-evolutionary process rather than from an incomplete factorization. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on benchmark evaluations outside any self-referential loop.

full rationale

The paper's core contribution is the MemEvolve framework and EvolveLab modularization of twelve prior memory systems into encode/store/retrieve/manage components. All reported results (17.06% gains, cross-task/cross-LLM transfer) are presented as outcomes of empirical runs on four agentic benchmarks. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claim to its own inputs appear in the abstract or described structure. The modular design space is an explicit engineering substrate for experimentation, not a derivation whose outputs are forced by construction from its inputs. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted on abstract only; no explicit free parameters, axioms, or invented entities are identifiable. The framework implicitly assumes the four-component modular design space is sufficient to represent and evolve prior memory systems.

pith-pipeline@v0.9.0 · 5562 in / 1068 out tokens · 25880 ms · 2026-05-15T22:14:51.431154+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
cs.LG 2026-05 unverdicted novelty 7.0

EvolveMem enables autonomous self-evolution of LLM memory retrieval configurations via LLM diagnosis and safeguards, delivering 25.7% gains over strong baselines on LoCoMo and 18.9% on MemBench with positive cross-ben...
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
M$^\star$: Every Task Deserves Its Own Memory Harness
cs.PL 2026-04 unverdicted novelty 7.0

M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
cs.CL 2026-05 unverdicted novelty 6.0

SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
PREPING: Building Agent Memory without Tasks
cs.AI 2026-05 unverdicted novelty 6.0

Preping builds agent memory via proposer-guided synthetic practice and selective validation, matching offline/online methods at 2-3x lower deployment cost.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
cs.CL 2026-04 unverdicted novelty 6.0

ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
Harnessing Pre-Resolution Signals for Future Prediction Agents
cs.AI 2026-04 unverdicted novelty 6.0

Milkyway evolves a future prediction harness using internal feedback from repeated predictions on the same unresolved question, achieving top scores on FutureX (44.07 to 60.90) and FutureWorld (62.22 to 77.96).
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
cs.AI 2026-05 unverdicted novelty 5.0

Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...
Harnessing Pre-Resolution Signals for Future Prediction Agents
cs.AI 2026-04 unverdicted novelty 5.0

Milkyway uses pre-resolution signals from temporal contrasts in evolving evidence and repeated forecasts to evolve a harness and improve predictions before resolution, outperforming baselines on FutureX and FutureWorld.
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
cs.SE 2026-04 unverdicted novelty 5.0

Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation
cs.CL 2026-04 unverdicted novelty 4.0

A minimalist retrieval-and-generation framework using turn isolation and query-driven pruning outperforms complex memory systems by directly addressing signal sparsity and dual-level redundancy in dialogues.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 18 Pith papers · 6 internal anchors

[1]

Pang, X., Peng, J., Peng, R., Qiao, Y., Qiu, J., Qu, X., Qu, Y., Ren, Y., Shang, F., Shao, W., Shen, J., Shen, S., Song, C., Song, D., Song, D., Su, C., Su, W., Sun, W., Sun, Y., Tan, Q., Tang, C., Tang, H., Tang, K., Tang, S., Tong, J., Wang, A., Wang, B., Wang, D., Wang, L., Wang, R., Wang, W., Wang, W., Wang, J., Wang, Y., Wang, Z., Wu, L.-I., Wu, W., ...

work page 2025
[2]

Cai, Y., Cai, S., Shi, Y., Xu, Z., Chen, L., Qin, Y., Tan, X., Li, G., Li, Z., Lin, H., Mao, Y., Li, K., and Sun, X. (2025). Training-free group relative policy optimization

work page 2025
[3]

Yin, Z., Ma, Z., and Mo, Z. (2025). xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations

work page 2025
[4]

L., Ni, J., Xu, L., Li, M., Tian, N., Chen, R

Li, Z., Zhang, Z., Xie, Z., Gao, Z., Pan, Z., Yao, Z., Feng, B., Li, H., Cai, J. L., Ni, J., Xu, L., Li, M., Tian, N., Chen, R. J., Jin, R. L., Li, S. S., Zhou, S., Sun, T., Li, X. Q., Jin, X., Shen, X., Chen, X., Song, X., Zhou, X., Zhu, Y. X., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Huang, Z., Xu, Z., Zhang, Z., Ji, D., Liang, J., Guo, J., Chen, ...

work page 2025
[5]

Du, M., Xu, B., Zhu, C., Wang, X., and Mao, Z. (2025). Deepresearch bench: A comprehensive benchmark for deep research agents

work page 2025
[6]

Rodriques, S. G. (2025). Robin: A multi-agent system for automating scientific discovery

work page 2025
[7]

Wu, C., and Schmidhuber, J. (2024). MetaGPT: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations. 14

work page 2024
[8]

Jiang, Y.-G., and Yan, S. (2025c). Memory in the age of ai agents. LangChain (2023). Langchain: Build context-aware reasoning applications. [Online]. https://github.com/langchain-ai/ langchain

work page 2023
[9]

Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., and Scialom, T. (2023). Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations. OpenAI (2025). Introducing GPT-5 — openai.com. https://openai.com/index/introducing-gpt-5/. [Accessed 16-12-2025]

work page 2023
[10]

Orhan, A. E. (2023). Recognition, recall, and retention of few-shot memories in large language models

work page 2023
[11]

Rofouei, M., Lin, H., Han, J., Lee, C.-Y., and Pfister, T. (2025). Reasoningbank: Scaling agent self-evolving with reasoning memory

work page 2025
[12]

Packer, C., Fang, V., Patil, S., Lin, K., Wooders, S., and Gonzalez, J. (2023). Memgpt: Towards llms as operating systems

work page 2023
[13]

Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., et al. (2025). Humanity’s last exam. arXiv preprint arXiv:2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

E., Gao, X., and Zhou, W

Qin, T., Chen, Q., Wang, S., Xing, H., Zhu, K., Zhu, H., Shi, D., Liu, X., Zhang, G., Liu, J., Jiang, Y. E., Gao, X., and Zhou, W. (2025). Flash-searcher: Fast and effective web agents via dag-based parallel execution

work page 2025
[15]

Qiu, J., Qi, X., Zhang, T., Juan, X., Guo, J., Lu, Y., Wang, Y., Yao, Z., Ren, Q., Jiang, X., et al. (2025b). Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. arXiv preprint arXiv:2505.20286

work page arXiv
[16]

Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., and Chalef, D. (2025). Zep: A temporal knowledge graph architecture for agent memory

work page 2025
[17]

V., Wolf, T., von Werra, L., and Kaunismäki, E

Roucher, A., del Moral, A. V., Wolf, T., von Werra, L., and Kaunismäki, E. (2025). ‘smolagents‘: a smol library to build great agentic systems.https://github.com/huggingface/smolagents

work page 2025
[18]

Shi, Y., Wang, M., Cao, Y., Lai, H., Lan, J., Han, X., Wang, Y., Geng, J., Li, Z., Xia, Z., et al. (2025b). Aime: Towards fully-autonomous multi-agent framework. arXiv preprint arXiv:2507.11988

work page arXiv
[19]

Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint, abs/2303.11366. Significant-Gravitas (2023). Autogpt. [Online].https://github.com/Significant-Gravitas/AutoGPT

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

and Zeng, S

Sun, H. and Zeng, S. (2025). Hierarchical memory for high-efficiency long-term reasoning in llm agents

work page 2025
[21]

Suzgun, M., Yuksekgonul, M., Bianchi, F., Jurafsky, D., and Zou, J. (2025). Dynamic cheatsheet: Test-time learning with adaptive memory

work page 2025
[22]

Tang, X., Hu, T., Ye, M., Shao, Y., Yin, X., Ouyang, S., Zhou, W., Lu, P., Zhang, Z., Zhao, Y., Cohan, A., and Gerstein, M. (2025). Chemagent: Self-updating library in large language models improves chemical reasoning

work page 2025
[23]

Tran, K.-T., Dao, D., Nguyen, M.-D., Pham, Q.-V., O’Sullivan, B., and Nguyen, H. D. (2025). Multi-agent collaboration mechanisms: A survey of llms

work page 2025
[24]

Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv e-prints, page arXiv:2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Wei, J., Sun, Z., Papay, S., McKinney, S., Han, J., Fulford, I., Chung, H. W., Passos, A. T., Fedus, W., and Glaese, A. (2025a). Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Wen, L., Fu, D., Li, X., Cai, X., Ma, T., Cai, P., Dou, M., Shi, B., He, L., and Qiao, Y. (2024). Dilu: A knowledge-driven approach to autonomous driving with large language models

work page 2024
[27]

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. (2023). Autogen: Enabling next-gen llm applications via multi-agent conversation framework

work page 2023
[28]

Yang, C., Yang, X., Wen, L., Fu, D., Mei, J., Wu, R., Cai, P., Shen, Y., Deng, N., Shi, B., Qiao, Y., and Li, H. (2025). Learning on the job: An experience-driven self-evolving agent for long-horizon tasks

work page 2025
[29]

Ye, S., Yu, C., Ke, K., Xu, C., and Wei, Y. (2025). H2r: Hierarchical hindsight reflection for multi-task llm agents. arXiv preprint arXiv:2509.12810

work page arXiv 2025
[30]

Yin, Z., Sun, Q., Chang, C., Guo, Q., Dai, J., Huang, X., and Qiu, X. (2023). Exchange-of-thought: Enhancing large language model capabilities through cross-model communication

work page 2023
[31]

Zhang, G., Chen, K., Wan, G., Chang, H., Cheng, H., Wang, K., Hu, S., and Bai, L. (2025a). Evoflow: Evolving diverse agentic workflows on the fly. arXiv preprint arXiv:2502.07373

work page arXiv
[32]

Zhang, G., Niu, L., Fang, J., Wang, K., Bai, L., and Wang, X. (2025c). Multi-agent architecture search via agentic supernet. arXiv preprint arXiv:2502.04180

work page arXiv
[33]

Zhang, J., Xiang, J., Yu, Z., Teng, F., Chen, X., Chen, J., Zhuge, M., Cheng, X., Hong, S., Wang, J., Zheng, B., Liu, B., Luo, Y., and Wu, C. (2024a). AFlow: Automating Agentic Workflow Generation. arXiv:2410.10762

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Zhang, Z., Bo, X., Ma, C., Li, R., Chen, X., Dai, Q., Zhu, J., Dong, Z., and Wen, J.-R. (2024b). A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y.-J., and Huang, G. (2024). Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642

work page 2024
[36]

Zhao, S., Zhang, H., Lin, S., Li, M., Wu, Q., Zhang, K., and Wei, C. (2025). Pyvision: Agentic vision with dynamic tooling

work page 2025
[37]

Y., Jin, X., Wang, Z

Zheng, B., Fatemi, M. Y., Jin, X., Wang, Z. Z., Gandhi, A., Song, Y., Gu, Y., Srinivasa, J., Liu, G., Neubig, G., and Su, Y. (2025). Skillweaver: Web agents can self-improve by discovering and honing skills

work page 2025
[38]

Zheng, L., Wang, R., Wang, X., and An, B. (2023). Synapse: Trajectory-as-exemplar prompting with memory for computer control. arXiv preprint arXiv:2306.07863

work page arXiv 2023
[39]

" " Abs tr ac t base class for memory p r o v i d e r s

Zhong, W., Guo, L., Gao, Q., Ye, H., and Wang, Y. (2024). Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731. 17 Appendix A EvolveLab Implementation EvolveLabis designed as a modular and extensible codebase to support the systematic study of self...

work page 2024