Recognition: 1 theorem link
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Pith reviewed 2026-05-14 23:08 UTC · model grok-4.3
The pith
LLM agents achieve continual improvement on streaming tasks by using the ReMem pipeline to integrate reasoning, actions, and memory updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReMem, an action-think-memory refine pipeline, tightly integrates reasoning, task actions, and memory updates to achieve continual improvement in LLM agents on streaming tasks.
What carries the argument
ReMem, the action-think-memory refine pipeline that couples reasoning and actions with memory updates for experience accumulation.
If this is right
- LLM agents can maintain and improve performance across evolving task streams without retraining.
- Memory modules can be unified and compared in a streaming setting for better experience reuse.
- ExpRAG baseline provides a way to retrieve and utilize prior experience in agent interactions.
- Performance gains appear on both multi-turn goal-oriented tasks and single-turn reasoning datasets.
Where Pith is reading between the lines
- Such integrated memory evolution might apply to embodied agents operating in dynamic physical environments.
- Future work could test if ReMem reduces hallucinations by maintaining updated contextual insights.
- Scaling this to longer task sequences could reveal limits in memory capacity for LLM agents.
Load-bearing premise
That the chosen sequential task streams and the implemented memory modules faithfully capture the dynamics of real-world continuous interactions where memory evolution is required, without hidden implementation biases affecting the comparisons.
What would settle it
If ReMem fails to show improvement over standard memory retrieval methods on an independent streaming benchmark with new task sequences, the claim of continual improvement through tight integration would be falsified.
read the original abstract
Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Evo-Memory, a streaming benchmark and framework for evaluating self-evolving memory in LLM agents on sequential task streams. It unifies and implements over ten memory modules, evaluates them on 10 multi-turn and single-turn datasets, provides ExpRAG as a baseline for experience retrieval and utilization, and proposes ReMem—an action-think-memory refine pipeline claimed to integrate reasoning, actions, and memory updates for continual improvement.
Significance. If the empirical comparisons hold after addressing controls, the work would offer a useful standardized benchmark for assessing dynamic memory evolution in LLM agents, filling a gap between static retrieval settings and real-world continuous interaction scenarios where agents must accumulate and reuse experience across evolving tasks.
major comments (2)
- [Results/Evaluation] Results/Evaluation section: The manuscript reports gains for ReMem over ExpRAG and the other modules but does not include a controlled ablation that disables memory updates while preserving the same number of reasoning and action steps (and thus inference budget). Without this isolation, it remains possible that observed improvements arise from extra tokens or passes rather than the self-evolution mechanism itself.
- [Benchmark construction] Benchmark construction: The description of how the 10 datasets are restructured into sequential task streams lacks explicit controls or statistics on task ordering, difficulty progression, and potential selection biases that could inadvertently favor modules with particular update heuristics.
minor comments (1)
- [Abstract] The abstract states that results are obtained across 10 datasets but provides no quantitative highlights, error bars, or dataset names; adding one or two key performance numbers would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We agree that the two major points raised require clarification and additional experiments to strengthen the manuscript. We will revise the paper accordingly and provide point-by-point responses below.
read point-by-point responses
-
Referee: Results/Evaluation section: The manuscript reports gains for ReMem over ExpRAG and the other modules but does not include a controlled ablation that disables memory updates while preserving the same number of reasoning and action steps (and thus inference budget). Without this isolation, it remains possible that observed improvements arise from extra tokens or passes rather than the self-evolution mechanism itself.
Authors: We agree that isolating the contribution of memory updates is essential. In the revised manuscript, we will add a controlled ablation study in which memory updates are disabled while keeping the exact same number of reasoning and action steps (and thus the same inference budget) as the full ReMem pipeline. This will allow direct comparison to confirm that performance gains arise from the self-evolution mechanism rather than additional computational steps. revision: yes
-
Referee: Benchmark construction: The description of how the 10 datasets are restructured into sequential task streams lacks explicit controls or statistics on task ordering, difficulty progression, and potential selection biases that could inadvertently favor modules with particular update heuristics.
Authors: We acknowledge the need for greater transparency in benchmark construction. We will expand the relevant section to include: (1) explicit statistics on task ordering and difficulty progression across the sequential streams, (2) details on how datasets were selected and restructured, and (3) an analysis of potential selection biases together with controls to ensure that no memory module is inadvertently advantaged by the stream construction. revision: yes
Circularity Check
No circularity: empirical benchmark and pipeline on external datasets
full rationale
The paper introduces Evo-Memory as a streaming benchmark constructed from standard multi-turn datasets, unifies existing memory modules for comparison, and proposes ReMem as an empirical action-think-memory pipeline. No equations, fitted parameters, or self-citations reduce the reported gains to quantities defined by the same inputs or prior author work by construction. All evaluations rely on externally defined task streams and baselines like ExpRAG, with the central claims resting on comparative performance rather than any self-referential derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Statefulness via memory is essential for LLM agents performing long-term planning and problem-solving
Forward citations
Cited by 24 Pith papers
-
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
-
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
-
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
AgentPSO evolves reusable multi-agent reasoning skills via PSO-inspired natural-language updates, outperforming static agents and test-time multi-agent baselines on math and general reasoning tasks with cross-benchmar...
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
-
M$^\star$: Every Task Deserves Its Own Memory Harness
M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.
-
Context Training with Active Information Seeking
Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
-
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution und...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
-
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing t...
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
-
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production d...
-
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
Bian Que deploys an agentic system with flexible skills and self-evolution on a major e-commerce search engine, cutting alerts by 75%, reaching 80% root-cause accuracy, and halving resolution time.
-
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...
-
Evo-MedAgent: Beyond One-Shot Diagnosis with Agents That Remember, Reflect, and Improve
Evo-MedAgent adds three evolving memory stores to LLM agents for chest X-ray diagnosis, raising MCQ accuracy from 0.68 to 0.79 on GPT-5-mini and 0.76 to 0.87 on Gemini-3 Flash without any training.
-
MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought
MemCoT redefines long-context reasoning as iterative stateful search with zoom-in/zoom-out memory perception and dual short-term memories, claiming SOTA results on LoCoMo and LongMemEval-S benchmarks.
-
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
ActionNex: A Virtual Outage Manager for Cloud Computing
ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.
Reference graph
Works this paper leans on
-
[1]
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding , author=. arXiv preprint arXiv:2009.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[2]
International Conference on Learning Representations (ICLR) , year=
Tent: Fully Test-time Entropy Minimization , author=. International Conference on Learning Representations (ICLR) , year=
-
[3]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Efficient Test-time Adaptation via Sample-Efficient Entropy Minimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[4]
Advances in Neural Information Processing Systems (NeurIPS) , year=
MEMO: Test-time Robustness via Adaptation and Augmentation , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[5]
International Conference on Machine Learning (ICML) , year=
SAR: Regularizing Test-time Adaptation for Stability , author=. International Conference on Machine Learning (ICML) , year=
-
[6]
arXiv preprint arXiv:2501.02497 , year=
Meta-Adapters: Enhancing Test-Time Adaptation via Meta-Learned Adapters , author=. arXiv preprint arXiv:2501.02497 , year=
-
[7]
Advances in Neural Information Processing Systems (NeurIPS) , year=
T3A: Test-Time Template Adjustments for Domain Generalization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[8]
IEEE/CVF International Conference on Computer Vision (ICCV) , year=
TTT++: When Does Test-time Training Fail or Thrive? , author=. IEEE/CVF International Conference on Computer Vision (ICCV) , year=
-
[9]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[10]
arXiv preprint arXiv:2402.10654 , year=
Eureka: Human-Level Reward Design via Coding Large Language Models , author=. arXiv preprint arXiv:2402.10654 , year=
-
[11]
ACM Symposium on User Interface Software and Technology (UIST) , year=
Generative Agents: Interactive Simulacra of Human Behavior , author=. ACM Symposium on User Interface Software and Technology (UIST) , year=
-
[13]
arXiv preprint arXiv:2506.08791 , year=
Self-Discovering Agents: Autonomous Skill Expansion via Continual Interaction , author=. arXiv preprint arXiv:2506.08791 , year=
-
[14]
arXiv preprint arXiv:2503.04567 , year=
LLM-as-Optimizer: Self-Improving Agents through Differentiable Feedback , author=. arXiv preprint arXiv:2503.04567 , year=
-
[15]
arXiv preprint arXiv:2504.03112 , year=
AgentBench 2.0: Evaluating Foundation Models as General-Purpose Agents , author=. arXiv preprint arXiv:2504.03112 , year=
-
[16]
arXiv preprint arXiv:2506.11835 , year=
AgentBoard: A Unified Evaluation Platform for Test-Time Learning Agents , author=. arXiv preprint arXiv:2506.11835 , year=
-
[17]
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence , author=. arXiv preprint arXiv:2507.21046 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2505.04112 , year=
AdaPlanner: Adaptive In-Context Planning for Continual Agents , author=. arXiv preprint arXiv:2505.04112 , year=
-
[19]
Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. 2024 , publisher=
work page 2024
-
[20]
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2023 , eprint=
work page 2023
-
[21]
arXiv preprint arXiv:2505.09231 , year=
LADDER: Language-Driven Adaptive Decision Refinement for Autonomous Agents , author=. arXiv preprint arXiv:2505.09231 , year=
-
[22]
arXiv preprint arXiv:2506.05498 , year=
TrustAgent: Safe and Reliable Self-Evolving Agents via Trust-Aware Reflection , author=. arXiv preprint arXiv:2506.05498 , year=
-
[23]
arXiv preprint arXiv:2504.08911 , year=
ReMA: Reinforcement-driven Multi-Agent Co-Evolution for Continual Adaptation , author=. arXiv preprint arXiv:2504.08911 , year=
-
[24]
arXiv preprint arXiv:2505.10087 , year=
GiGPO: Generative Interactive Group Policy Optimization for Multi-Agent Systems , author=. arXiv preprint arXiv:2505.10087 , year=
-
[25]
arXiv preprint arXiv:2506.06745 , year=
EvoMAC: Evolutionary Memory-Augmented Coding Agents , author=. arXiv preprint arXiv:2506.06745 , year=
-
[26]
arXiv preprint arXiv:2505.08394 , year=
AgentCoder: Self-Improving Code Generation through Interactive Debugging , author=. arXiv preprint arXiv:2505.08394 , year=
-
[27]
arXiv preprint arXiv:2507.03218 , year=
WebVoyager: Continual Web-Scale Exploration with Self-Evolving Agents , author=. arXiv preprint arXiv:2507.03218 , year=
-
[28]
arXiv preprint arXiv:2505.12987 , year=
QuantAgent: Continual Test-Time Learning for Financial Decision Agents , author=. arXiv preprint arXiv:2505.12987 , year=
-
[29]
arXiv preprint arXiv:2506.11273 , year=
Agent Hospital: A Simulation Framework for Continual Medical Decision Agents , author=. arXiv preprint arXiv:2506.11273 , year=
-
[30]
Advances in Neural Information Processing Systems , volume=
Hipporag: Neurobiologically inspired long-term memory for large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
International Conference on Machine Learning (ICML) , year=
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. International Conference on Machine Learning (ICML) , year=
- [32]
-
[33]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Gorilla: Large Language Model Connected with Massive APIs
Gorilla: Large Language Model Connected with Massive APIs , author=. arXiv preprint arXiv:2305.15334 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation , author=. NeurIPS 2022 , year=
work page 2022
-
[36]
Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
-
[37]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. arXiv preprint arXiv:2406.01574 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
First Conference on Language Modeling , year=
Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
- [39]
- [40]
-
[41]
arXiv preprint arXiv:2505.11942 , year=
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners , author=. arXiv preprint arXiv:2505.11942 , year=
-
[42]
Advances in Neural Information Processing Systems , volume=
Streambench: Towards benchmarking continuous improvement of language agents , author=. Advances in Neural Information Processing Systems , volume=
-
[43]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [44]
- [45]
- [46]
-
[47]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=
-
[48]
Evaluating Very Long-Term Conversational Memory of LLM Agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[49]
arXiv preprint arXiv:2507.07957 , year=
Mirix: Multi-agent memory system for llm-based agents , author=. arXiv preprint arXiv:2507.07957 , year=
-
[50]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[51]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [52]
-
[53]
arXiv preprint arXiv:2405.08550 , year=
Learning multi-agent communication from graph modeling perspective , author=. arXiv preprint arXiv:2405.08550 , year=
-
[54]
Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. arXiv preprint arXiv:2508.19828 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
arXiv preprint arXiv:2410.11782 , year=
G-designer: Architecting multi-agent communication topologies via graph neural networks , author=. arXiv preprint arXiv:2410.11782 , year=
-
[56]
Graph Diffusion for Robust Multi-Agent Coordination , author=
-
[57]
arXiv preprint arXiv:2410.02506 , year=
Cut the crap: An economical communication pipeline for llm-based multi-agent systems , author=. arXiv preprint arXiv:2410.02506 , year=
-
[58]
arXiv preprint arXiv:2506.02951 , year=
Adaptive Graph Pruning for Multi-Agent Communication , author=. arXiv preprint arXiv:2506.02951 , year=
-
[59]
arXiv preprint arXiv:2406.07155 , year=
Scaling large language model-based multi-agent collaboration , author=. arXiv preprint arXiv:2406.07155 , year=
-
[60]
arXiv preprint arXiv:2502.02533 , year=
Multi-agent design: Optimizing agents with better prompts and topologies , author=. arXiv preprint arXiv:2502.02533 , year=
-
[61]
arXiv preprint arXiv:2502.04180 , year=
Multi-agent architecture search via agentic supernet , author=. arXiv preprint arXiv:2502.04180 , year=
-
[62]
arXiv preprint arXiv:2410.10762 , year=
Aflow: Automating agentic workflow generation , author=. arXiv preprint arXiv:2410.10762 , year=
-
[63]
arXiv preprint arXiv:2205.10016 , year=
Learning Progress Driven Multi-Agent Curriculum , author=. arXiv preprint arXiv:2205.10016 , year=
-
[64]
arXiv preprint arXiv:2501.18944 , year=
O-MAPL: Offline Multi-agent Preference Learning , author=. arXiv preprint arXiv:2501.18944 , year=
-
[65]
arXiv preprint arXiv:2503.02077 , year=
M3hf: Multi-agent reinforcement learning from multi-phase human feedback of mixed quality , author=. arXiv preprint arXiv:2503.02077 , year=
-
[66]
arXiv preprint arXiv:2505.05262 , year=
Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration , author=. arXiv preprint arXiv:2505.05262 , year=
-
[67]
arXiv preprint arXiv:2505.07207 , year=
HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2505.07207 , year=
-
[68]
R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning , author=. arXiv preprint arXiv:2505.24265 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
arXiv preprint arXiv:2508.04652 , year=
LLM Collaboration With Multi-Agent Reinforcement Learning , author=. arXiv preprint arXiv:2508.04652 , year=
-
[70]
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems , author=. arXiv preprint arXiv:2506.02718 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
IEEE Robotics and Automation Letters , year=
LAMARL: LLM-Aided Multi-Agent Reinforcement Learning for Cooperative Policy Generation , author=. IEEE Robotics and Automation Letters , year=
-
[72]
Advances in Neural Information Processing Systems , volume=
Coevolving with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[73]
arXiv preprint arXiv:2502.18439 , year=
Maporl: Multi-agent post-co-training for collaborative large language models with reinforcement learning , author=. arXiv preprint arXiv:2502.18439 , year=
-
[74]
arXiv preprint arXiv:2503.10049 , year=
Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy , author=. arXiv preprint arXiv:2503.10049 , year=
-
[75]
arXiv preprint arXiv:2410.02189 , year=
Agent-oriented planning in multi-agent systems , author=. arXiv preprint arXiv:2410.02189 , year=
-
[76]
arXiv preprint arXiv:2410.02958 , year=
Automl-agent: A multi-agent llm framework for full-pipeline automl , author=. arXiv preprint arXiv:2410.02958 , year=
-
[77]
arXiv preprint arXiv:2411.04468 , year=
Magentic-one: A generalist multi-agent system for solving complex tasks , author=. arXiv preprint arXiv:2411.04468 , year=
-
[78]
arXiv preprint arXiv:2503.03686 , year=
MAS-GPT: Training LLMs to build LLM-based multi-agent systems , author=. arXiv preprint arXiv:2503.03686 , year=
-
[79]
arXiv preprint arXiv:2507.22606 , year=
MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines , author=. arXiv preprint arXiv:2507.22606 , year=
-
[80]
arXiv preprint arXiv:2412.05255 , year=
Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft , author=. arXiv preprint arXiv:2412.05255 , year=
-
[81]
arXiv preprint arXiv:2403.19267 , year=
Mineland: Simulating large-scale multi-agent interactions with limited multimodal senses and physical needs , author=. arXiv preprint arXiv:2403.19267 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.