arxiv: 2511.20857 · v1 · submitted 2025-11-25 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Benjamin Coleman, Chi Wang, Derek Zhiyuan Cheng, Ed H. Chi, Fernando Pereira, Jingrui He, Mengting Ai, Noveen Sachdeva, Shuo Chen, Tianxin Wei, Wang-Cheng Kang, Xuying Ning, Yuanchen Bei, Yunzhe Li, Zhankui He

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM agentsself-evolving memorytest-time learningmemory modulesstreaming benchmarkexperience reuseReMem pipeline

0 comments

The pith

LLM agents achieve continual improvement on streaming tasks by using the ReMem pipeline to integrate reasoning, actions, and memory updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Evo-Memory, a benchmark that structures tasks into sequential streams to test how LLM agents can evolve their memory during deployment. It evaluates over ten memory modules across multiple datasets and proposes ReMem as a baseline method that refines memory through an action-think-memory process. A sympathetic reader would care because this addresses the gap in handling continuous interactions where agents must accumulate experience rather than relying on static memory retrieval.

Core claim

ReMem, an action-think-memory refine pipeline, tightly integrates reasoning, task actions, and memory updates to achieve continual improvement in LLM agents on streaming tasks.

What carries the argument

ReMem, the action-think-memory refine pipeline that couples reasoning and actions with memory updates for experience accumulation.

If this is right

LLM agents can maintain and improve performance across evolving task streams without retraining.
Memory modules can be unified and compared in a streaming setting for better experience reuse.
ExpRAG baseline provides a way to retrieve and utilize prior experience in agent interactions.
Performance gains appear on both multi-turn goal-oriented tasks and single-turn reasoning datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such integrated memory evolution might apply to embodied agents operating in dynamic physical environments.
Future work could test if ReMem reduces hallucinations by maintaining updated contextual insights.
Scaling this to longer task sequences could reveal limits in memory capacity for LLM agents.

Load-bearing premise

That the chosen sequential task streams and the implemented memory modules faithfully capture the dynamics of real-world continuous interactions where memory evolution is required, without hidden implementation biases affecting the comparisons.

What would settle it

If ReMem fails to show improvement over standard memory retrieval methods on an independent streaming benchmark with new task sequences, the claim of continual improvement through tight integration would be falsified.

read the original abstract

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Evo-Memory adds a streaming benchmark for agent memory evolution and ReMem as an integrated pipeline, but the reported gains may come from extra reasoning steps rather than memory changes themselves.

read the letter

The main thing here is that the paper turns standard datasets into sequential task streams so agents must accumulate and update memory across interactions instead of just retrieving from static history. They also propose ReMem, which loops action, thinking, and memory refinement together, and they unify more than ten existing memory modules for head-to-head testing on ten datasets plus the ExpRAG baseline for experience reuse. That unification and the shift to streaming evaluation are the concrete steps forward; they make direct comparisons easier and move closer to the kind of continual learning needed in interactive or embodied settings. The benchmark construction itself looks practical and reusable for anyone working on long-horizon agents. The soft spot is the missing isolation for ReMem. Without an ablation that keeps the extra reasoning passes but freezes the memory content, it is hard to tell whether the measured lift comes from actual memory evolution or simply from spending more tokens on thinking and acting at each step. The paper reports results across the datasets, but that control would strengthen the causal claim. If it is present in the full text it should be highlighted; if not, the interpretation stays loose. This work is aimed at researchers building or evaluating memory systems for agents that face ongoing task streams. A reader who needs a ready testbed or a set of baseline numbers to compare against would find it useful. I would send it to peer review. The benchmark fills a real gap and the evaluation scope is broad enough to justify referee time, even if the method needs tighter controls on where the gains originate.

Referee Report

2 major / 1 minor

Summary. The paper introduces Evo-Memory, a streaming benchmark and framework for evaluating self-evolving memory in LLM agents on sequential task streams. It unifies and implements over ten memory modules, evaluates them on 10 multi-turn and single-turn datasets, provides ExpRAG as a baseline for experience retrieval and utilization, and proposes ReMem—an action-think-memory refine pipeline claimed to integrate reasoning, actions, and memory updates for continual improvement.

Significance. If the empirical comparisons hold after addressing controls, the work would offer a useful standardized benchmark for assessing dynamic memory evolution in LLM agents, filling a gap between static retrieval settings and real-world continuous interaction scenarios where agents must accumulate and reuse experience across evolving tasks.

major comments (2)

[Results/Evaluation] Results/Evaluation section: The manuscript reports gains for ReMem over ExpRAG and the other modules but does not include a controlled ablation that disables memory updates while preserving the same number of reasoning and action steps (and thus inference budget). Without this isolation, it remains possible that observed improvements arise from extra tokens or passes rather than the self-evolution mechanism itself.
[Benchmark construction] Benchmark construction: The description of how the 10 datasets are restructured into sequential task streams lacks explicit controls or statistics on task ordering, difficulty progression, and potential selection biases that could inadvertently favor modules with particular update heuristics.

minor comments (1)

[Abstract] The abstract states that results are obtained across 10 datasets but provides no quantitative highlights, error bars, or dataset names; adding one or two key performance numbers would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We agree that the two major points raised require clarification and additional experiments to strengthen the manuscript. We will revise the paper accordingly and provide point-by-point responses below.

read point-by-point responses

Referee: Results/Evaluation section: The manuscript reports gains for ReMem over ExpRAG and the other modules but does not include a controlled ablation that disables memory updates while preserving the same number of reasoning and action steps (and thus inference budget). Without this isolation, it remains possible that observed improvements arise from extra tokens or passes rather than the self-evolution mechanism itself.

Authors: We agree that isolating the contribution of memory updates is essential. In the revised manuscript, we will add a controlled ablation study in which memory updates are disabled while keeping the exact same number of reasoning and action steps (and thus the same inference budget) as the full ReMem pipeline. This will allow direct comparison to confirm that performance gains arise from the self-evolution mechanism rather than additional computational steps. revision: yes
Referee: Benchmark construction: The description of how the 10 datasets are restructured into sequential task streams lacks explicit controls or statistics on task ordering, difficulty progression, and potential selection biases that could inadvertently favor modules with particular update heuristics.

Authors: We acknowledge the need for greater transparency in benchmark construction. We will expand the relevant section to include: (1) explicit statistics on task ordering and difficulty progression across the sequential streams, (2) details on how datasets were selected and restructured, and (3) an analysis of potential selection biases together with controls to ensure that no memory module is inadvertently advantaged by the stream construction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and pipeline on external datasets

full rationale

The paper introduces Evo-Memory as a streaming benchmark constructed from standard multi-turn datasets, unifies existing memory modules for comparison, and proposes ReMem as an empirical action-think-memory pipeline. No equations, fitted parameters, or self-citations reduce the reported gains to quantities defined by the same inputs or prior author work by construction. All evaluations rely on externally defined task streams and baselines like ExpRAG, with the central claims resting on comparative performance rather than any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that memory evolution is necessary for long-term agent performance; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption Statefulness via memory is essential for LLM agents performing long-term planning and problem-solving
Explicitly stated as the motivation in the first sentence of the abstract.

pith-pipeline@v0.9.0 · 5597 in / 1135 out tokens · 38939 ms · 2026-05-14T23:08:31.892284+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
cs.AI 2026-05 unverdicted novelty 8.0

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
cs.AI 2026-05 unverdicted novelty 7.0

AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmar...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
M$^\star$: Every Task Deserves Its Own Memory Harness
cs.PL 2026-04 unverdicted novelty 7.0

M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.
Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 6.0

Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
cs.AI 2026-05 unverdicted novelty 6.0

Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution und...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
cs.CL 2026-04 unverdicted novelty 6.0

TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing t...
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
cs.AI 2026-04 unverdicted novelty 5.0

Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production d...
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
cs.AI 2026-04 unverdicted novelty 5.0

Bian Que deploys an agentic system with flexible skills and self-evolution on a major e-commerce search engine, cutting alerts by 75%, reaching 80% root-cause accuracy, and halving resolution time.
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
cs.SE 2026-04 unverdicted novelty 5.0

Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
cs.AI 2026-04 unverdicted novelty 5.0

Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
cs.MA 2026-04 unverdicted novelty 5.0

MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
ActionNex: A Virtual Outage Manager for Cloud Computing
cs.AI 2026-04 unverdicted novelty 4.0

ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 21 Pith papers · 25 internal anchors

[1]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[2]

International Conference on Learning Representations (ICLR) , year=

Tent: Fully Test-time Entropy Minimization , author=. International Conference on Learning Representations (ICLR) , year=

work page
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Efficient Test-time Adaptation via Sample-Efficient Entropy Minimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[4]

Advances in Neural Information Processing Systems (NeurIPS) , year=

MEMO: Test-time Robustness via Adaptation and Augmentation , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[5]

International Conference on Machine Learning (ICML) , year=

SAR: Regularizing Test-time Adaptation for Stability , author=. International Conference on Machine Learning (ICML) , year=

work page
[6]

arXiv preprint arXiv:2501.02497 , year=

Meta-Adapters: Enhancing Test-Time Adaptation via Meta-Learned Adapters , author=. arXiv preprint arXiv:2501.02497 , year=

work page arXiv
[7]

Advances in Neural Information Processing Systems (NeurIPS) , year=

T3A: Test-Time Template Adjustments for Domain Generalization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[8]

IEEE/CVF International Conference on Computer Vision (ICCV) , year=

TTT++: When Does Test-time Training Fail or Thrive? , author=. IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page
[9]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[10]

arXiv preprint arXiv:2402.10654 , year=

Eureka: Human-Level Reward Design via Coding Large Language Models , author=. arXiv preprint arXiv:2402.10654 , year=

work page arXiv
[11]

ACM Symposium on User Interface Software and Technology (UIST) , year=

Generative Agents: Interactive Simulacra of Human Behavior , author=. ACM Symposium on User Interface Software and Technology (UIST) , year=

work page
[13]

arXiv preprint arXiv:2506.08791 , year=

Self-Discovering Agents: Autonomous Skill Expansion via Continual Interaction , author=. arXiv preprint arXiv:2506.08791 , year=

work page arXiv
[14]

arXiv preprint arXiv:2503.04567 , year=

LLM-as-Optimizer: Self-Improving Agents through Differentiable Feedback , author=. arXiv preprint arXiv:2503.04567 , year=

work page arXiv
[15]

arXiv preprint arXiv:2504.03112 , year=

AgentBench 2.0: Evaluating Foundation Models as General-Purpose Agents , author=. arXiv preprint arXiv:2504.03112 , year=

work page arXiv
[16]

arXiv preprint arXiv:2506.11835 , year=

AgentBoard: A Unified Evaluation Platform for Test-Time Learning Agents , author=. arXiv preprint arXiv:2506.11835 , year=

work page arXiv
[17]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence , author=. arXiv preprint arXiv:2507.21046 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2505.04112 , year=

AdaPlanner: Adaptive In-Context Planning for Continual Agents , author=. arXiv preprint arXiv:2505.04112 , year=

work page arXiv
[19]

2024 , publisher=

Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. 2024 , publisher=

work page 2024
[20]

2023 , eprint=

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2023 , eprint=

work page 2023
[21]

arXiv preprint arXiv:2505.09231 , year=

LADDER: Language-Driven Adaptive Decision Refinement for Autonomous Agents , author=. arXiv preprint arXiv:2505.09231 , year=

work page arXiv
[22]

arXiv preprint arXiv:2506.05498 , year=

TrustAgent: Safe and Reliable Self-Evolving Agents via Trust-Aware Reflection , author=. arXiv preprint arXiv:2506.05498 , year=

work page arXiv
[23]

arXiv preprint arXiv:2504.08911 , year=

ReMA: Reinforcement-driven Multi-Agent Co-Evolution for Continual Adaptation , author=. arXiv preprint arXiv:2504.08911 , year=

work page arXiv
[24]

arXiv preprint arXiv:2505.10087 , year=

GiGPO: Generative Interactive Group Policy Optimization for Multi-Agent Systems , author=. arXiv preprint arXiv:2505.10087 , year=

work page arXiv
[25]

arXiv preprint arXiv:2506.06745 , year=

EvoMAC: Evolutionary Memory-Augmented Coding Agents , author=. arXiv preprint arXiv:2506.06745 , year=

work page arXiv
[26]

arXiv preprint arXiv:2505.08394 , year=

AgentCoder: Self-Improving Code Generation through Interactive Debugging , author=. arXiv preprint arXiv:2505.08394 , year=

work page arXiv
[27]

arXiv preprint arXiv:2507.03218 , year=

WebVoyager: Continual Web-Scale Exploration with Self-Evolving Agents , author=. arXiv preprint arXiv:2507.03218 , year=

work page arXiv
[28]

arXiv preprint arXiv:2505.12987 , year=

QuantAgent: Continual Test-Time Learning for Financial Decision Agents , author=. arXiv preprint arXiv:2505.12987 , year=

work page arXiv
[29]

arXiv preprint arXiv:2506.11273 , year=

Agent Hospital: A Simulation Framework for Continual Medical Decision Agents , author=. arXiv preprint arXiv:2506.11273 , year=

work page arXiv
[30]

Advances in Neural Information Processing Systems , volume=

Hipporag: Neurobiologically inspired long-term memory for large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

International Conference on Machine Learning (ICML) , year=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. International Conference on Machine Learning (ICML) , year=

work page
[32]

NeurIPS 2023 , year=

Let’s Verify Step by Step , author=. NeurIPS 2023 , year=

work page 2023
[33]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large Language Model Connected with Massive APIs , author=. arXiv preprint arXiv:2305.15334 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

NeurIPS 2022 , year=

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation , author=. NeurIPS 2022 , year=

work page 2022
[36]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page
[37]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

work page
[39]

2024 , note =

HuggingFaceH4 , title =. 2024 , note =

work page 2024
[40]

2025 , note =

HuggingFaceH4 , title =. 2025 , note =

work page 2025
[41]

arXiv preprint arXiv:2505.11942 , year=

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners , author=. arXiv preprint arXiv:2505.11942 , year=

work page arXiv
[42]

Advances in Neural Information Processing Systems , volume=

Streambench: Towards benchmarking continuous improvement of language agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

2025 , url =

LangChain , author =. 2025 , url =

work page 2025
[45]

, author =

`smolagents`: a smol library to build great agentic systems. , author =

work page
[46]

2025 , note =

Anthropic , title =. 2025 , note =

work page 2025
[47]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=

work page
[48]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Evaluating Very Long-Term Conversational Memory of LLM Agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[49]

arXiv preprint arXiv:2507.07957 , year=

Mirix: Multi-agent memory system for llm-based agents , author=. arXiv preprint arXiv:2507.07957 , year=

work page arXiv
[50]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[51]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[52]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[53]

arXiv preprint arXiv:2405.08550 , year=

Learning multi-agent communication from graph modeling perspective , author=. arXiv preprint arXiv:2405.08550 , year=

work page arXiv
[54]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. arXiv preprint arXiv:2508.19828 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

arXiv preprint arXiv:2410.11782 , year=

G-designer: Architecting multi-agent communication topologies via graph neural networks , author=. arXiv preprint arXiv:2410.11782 , year=

work page arXiv
[56]

Graph Diffusion for Robust Multi-Agent Coordination , author=

work page
[57]

arXiv preprint arXiv:2410.02506 , year=

Cut the crap: An economical communication pipeline for llm-based multi-agent systems , author=. arXiv preprint arXiv:2410.02506 , year=

work page arXiv
[58]

arXiv preprint arXiv:2506.02951 , year=

Adaptive Graph Pruning for Multi-Agent Communication , author=. arXiv preprint arXiv:2506.02951 , year=

work page arXiv
[59]

arXiv preprint arXiv:2406.07155 , year=

Scaling large language model-based multi-agent collaboration , author=. arXiv preprint arXiv:2406.07155 , year=

work page arXiv
[60]

arXiv preprint arXiv:2502.02533 , year=

Multi-agent design: Optimizing agents with better prompts and topologies , author=. arXiv preprint arXiv:2502.02533 , year=

work page arXiv
[61]

arXiv preprint arXiv:2502.04180 , year=

Multi-agent architecture search via agentic supernet , author=. arXiv preprint arXiv:2502.04180 , year=

work page arXiv
[62]

arXiv preprint arXiv:2410.10762 , year=

Aflow: Automating agentic workflow generation , author=. arXiv preprint arXiv:2410.10762 , year=

work page arXiv
[63]

arXiv preprint arXiv:2205.10016 , year=

Learning Progress Driven Multi-Agent Curriculum , author=. arXiv preprint arXiv:2205.10016 , year=

work page arXiv
[64]

arXiv preprint arXiv:2501.18944 , year=

O-MAPL: Offline Multi-agent Preference Learning , author=. arXiv preprint arXiv:2501.18944 , year=

work page arXiv
[65]

arXiv preprint arXiv:2503.02077 , year=

M3hf: Multi-agent reinforcement learning from multi-phase human feedback of mixed quality , author=. arXiv preprint arXiv:2503.02077 , year=

work page arXiv
[66]

arXiv preprint arXiv:2505.05262 , year=

Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration , author=. arXiv preprint arXiv:2505.05262 , year=

work page arXiv
[67]

arXiv preprint arXiv:2505.07207 , year=

HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2505.07207 , year=

work page arXiv
[68]

R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning

R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning , author=. arXiv preprint arXiv:2505.24265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

arXiv preprint arXiv:2508.04652 , year=

LLM Collaboration With Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2508.04652 , year=

work page arXiv
[70]

End-to-End Optimization of LLM-Driven Multi-Agent Search Systems via Heterogeneous-Group-Based Reinforcement Learning

Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2506.02718 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

IEEE Robotics and Automation Letters , year=

LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation , author=. IEEE Robotics and Automation Letters , year=

work page
[72]

Advances in Neural Information Processing Systems , volume=

Coevolving with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[73]

arXiv preprint arXiv:2502.18439 , year=

Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning , author=. arXiv preprint arXiv:2502.18439 , year=

work page arXiv
[74]

arXiv preprint arXiv:2503.10049 , year=

Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy , author=. arXiv preprint arXiv:2503.10049 , year=

work page arXiv
[75]

arXiv preprint arXiv:2410.02189 , year=

Agent-oriented planning in multi-agent systems , author=. arXiv preprint arXiv:2410.02189 , year=

work page arXiv
[76]

arXiv preprint arXiv:2410.02958 , year=

Automl-agent: A multi-agent llm framework for full-pipeline automl , author=. arXiv preprint arXiv:2410.02958 , year=

work page arXiv
[77]

arXiv preprint arXiv:2411.04468 , year=

Magentic-one: A generalist multi-agent system for solving complex tasks , author=. arXiv preprint arXiv:2411.04468 , year=

work page arXiv
[78]

arXiv preprint arXiv:2503.03686 , year=

MAS-GPT: Training LLMs to build LLM-based multi-agent systems , author=. arXiv preprint arXiv:2503.03686 , year=

work page arXiv
[79]

arXiv preprint arXiv:2507.22606 , year=

MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines , author=. arXiv preprint arXiv:2507.22606 , year=

work page arXiv
[80]

arXiv preprint arXiv:2412.05255 , year=

Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft , author=. arXiv preprint arXiv:2412.05255 , year=

work page arXiv
[81]

arXiv preprint arXiv:2403.19267 , year=

Mineland: Simulating large-scale multi-agent interactions with limited multimodal senses and physical needs , author=. arXiv preprint arXiv:2403.19267 , year=

work page arXiv

Showing first 80 references.