arxiv: 2509.25140 · v2 · submitted 2025-09-29 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang , Jun Yan , I-Hung Hsu , Yanfei Chen , Ke Jiang , Zifeng Wang , Rujun Han , Long T. Le

show 9 more authors

Samira Daruki Xiangru Tang Vishy Tirumalashetty George Lee Mahsan Rofouei Hangfei Lin Jiawei Han Chen-Yu Lee Tomas Pfister

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords ReasoningBankagent memoryreasoning strategiestest-time scalingself-evolving agentsexperience scalingLLM agentsweb browsing benchmarks

0 comments

The pith

ReasoningBank lets LLM agents distill generalizable strategies from both successes and failures to improve on new tasks over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ReasoningBank as a memory system that extracts reusable reasoning strategies from an agent's own self-judged successes and failures rather than storing raw interaction logs. At test time the agent retrieves these memories to guide its behavior and then folds fresh experiences back into the bank, creating cumulative improvement across streams of tasks. The work pairs this memory with memory-aware test-time scaling that spends extra compute per task to generate diverse experience sets, which in turn supply stronger contrastive signals for synthesizing better memories and close a self-reinforcing loop.

Core claim

ReasoningBank distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences; at test time the agent retrieves relevant memories to shape its next actions and integrates the resulting learnings back into the bank. Memory-aware test-time scaling amplifies the process by allocating additional compute to each task, producing abundant diverse experiences that yield higher-quality memory entries through contrastive synthesis. The resulting memory in turn guides more effective scaling, establishing memory-driven experience scaling as a new dimension that lets agents self-evolve with emergent behaviors.

What carries the argument

ReasoningBank, a memory store of distilled reasoning strategies drawn from both successes and failures, retrieved at test time to inform actions and updated with new learnings, together with memory-aware test-time scaling that generates diverse contrastive experiences to improve memory quality.

If this is right

Agents using ReasoningBank outperform those that store raw trajectories or only successful routines on web-browsing and software-engineering benchmarks.
Allocating extra compute via MaTTS produces richer experience sets that synthesize higher-quality memories and accelerate capability growth.
Memory-driven experience scaling emerges as a distinct scaling axis that compounds with existing test-time compute scaling.
Accumulated memories enable agents to avoid repeating past errors and exhibit emergent self-improvement behaviors across sequential tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive-memory loop could be applied to domains with long task sequences where forgetting prior constraints is costly, such as multi-step scientific workflows.
If self-judgment noise is high, the framework may require an external verifier step before memory ingestion to prevent drift.
Memory retrieval could be extended with explicit uncertainty estimates so the agent knows when to trust stored strategies versus falling back to base reasoning.

Load-bearing premise

An agent's own judgment of whether an outcome counts as success or failure supplies reliable signals that can be turned into strategies that transfer usefully to new tasks.

What would settle it

An experiment in which agents equipped with ReasoningBank show no gain or outright worse performance than raw-trajectory or success-only baselines on a held-out task distribution after several cycles of memory use and update.

read the original abstract

With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent's interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise. Our code can be found at https://github.com/google-research/reasoning-bank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReasoningBank distills contrastive strategies from self-judged successes and failures then pairs it with memory-guided test-time scaling, but the gains rest on unverified judgment quality.

read the letter

The main point is that this paper introduces ReasoningBank to pull general reasoning patterns out of both an agent's wins and losses rather than raw traces or successes alone, then adds MaTTS to spend extra compute per task and feed the richer experiences back into the memory. The abstract reports that this beats prior memory baselines on web-browsing and software-engineering benchmarks while also improving efficiency, and the code is released for inspection. That combination is the concrete step forward: it treats accumulated interaction history as a source of contrastive signals instead of just replaying episodes. The loop between better memory and more effective scaling at test time is a practical addition for agents that run over long sequences of tasks. The empirical comparisons are the strongest part of what is shown so far. The soft spot is the dependence on the agent's own success/failure labels. In domains with partial observability and fuzzy outcome criteria, those self-judgments can be noisy or systematically biased, and any errors would flow straight into the distilled strategies. The abstract gives no separate measurement of label accuracy, no human validation of the extracted memories, and no ablation that isolates whether the gains come from the contrastive content or simply from the retrieval mechanism. That leaves the central claim only moderately supported until those checks appear. This work is aimed at groups building persistent LLM agents that need to accumulate and reuse experience without retraining the base model. Readers working on memory architectures or test-time compute for agents will get the most from the benchmark results and the released implementation. The argument is coherent on its own terms and the experiments are relevant, so the paper deserves a serious referee even if revisions will likely need tighter evidence on the quality of the self-generated labels.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ReasoningBank, a memory framework for LLM agents that distills generalizable reasoning strategies from self-judged successful and failed task experiences. Agents retrieve relevant memories at test time to inform interactions and integrate new learnings back into the bank. It further introduces memory-aware test-time scaling (MaTTS) to generate abundant diverse experiences via increased compute, creating a claimed synergy where better memory enables more effective scaling. Evaluations on web-browsing and software-engineering benchmarks report consistent outperformance over baselines storing raw trajectories or only successful routines, with MaTTS amplifying gains, establishing memory-driven experience scaling as a new dimension for agent self-evolution.

Significance. If the results hold under rigorous validation, the work is significant for introducing structured reasoning memory as a scalable mechanism for persistent agent improvement, distinct from raw trajectory storage. The MaTTS synergy and open-sourced code at the provided GitHub link are notable strengths that support reproducibility and further research on memory as a scaling axis.

major comments (2)

[Experimental Evaluation] The central claim that self-judged success/failure labels produce reliable, transferable reasoning strategies (rather than noisy or biased signals) is load-bearing for the outperformance over raw-trajectory baselines, yet the manuscript reports no direct measurement of judgment accuracy, such as agreement with oracle success labels or human ratings, particularly in partially observable domains like web browsing and software engineering.
[Results] The results section lacks ablations isolating the contribution of failure experiences (versus successes only) and does not report statistical significance, variance across runs, or full experimental protocol details, weakening support for the consistent benchmark gains and the claimed synergy with MaTTS.

minor comments (2)

[Abstract] The abstract's final sentence has a grammatical issue ('enabling agents to self-evolve with emergent behaviors naturally arise') that reduces clarity; rephrase to 'enabling agents to self-evolve via emergent behaviors that naturally arise.'
[Method] The distillation process in the method description would benefit from explicit pseudocode or example prompts showing how reasoning strategies are extracted from experiences to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental rigor. We address each major point below and will revise the manuscript to incorporate the suggested analyses and details.

read point-by-point responses

Referee: [Experimental Evaluation] The central claim that self-judged success/failure labels produce reliable, transferable reasoning strategies (rather than noisy or biased signals) is load-bearing for the outperformance over raw-trajectory baselines, yet the manuscript reports no direct measurement of judgment accuracy, such as agreement with oracle success labels or human ratings, particularly in partially observable domains like web browsing and software engineering.

Authors: We agree that direct measurement of self-judgment reliability would strengthen the central claim. In the revised version we will add a dedicated analysis section that (i) compares agent self-judged success labels against oracle ground-truth labels on the software-engineering tasks where verifiable outcomes exist, reporting agreement rates and confusion matrices, and (ii) presents human ratings on a random sample of web-browsing judgments (approximately 100 instances) to quantify reliability under partial observability. These additions will provide quantitative evidence on the quality of the distilled reasoning strategies. revision: yes
Referee: [Results] The results section lacks ablations isolating the contribution of failure experiences (versus successes only) and does not report statistical significance, variance across runs, or full experimental protocol details, weakening support for the consistent benchmark gains and the claimed synergy with MaTTS.

Authors: We acknowledge these gaps in the current presentation. The revised manuscript will include: (1) an explicit ablation comparing ReasoningBank (success + failure) against a success-only variant to isolate the value of failure-derived strategies; (2) mean and standard deviation across at least three independent runs for all main tables, together with paired t-test p-values against the strongest baseline; and (3) an expanded experimental-protocol appendix detailing retrieval hyperparameters, memory-update rules, MaTTS compute budgets, and random seeds. These changes will make the reported gains and the memory-scaling synergy more statistically robust. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with external benchmarks

full rationale

The paper defines ReasoningBank through retrieval and distillation operating on external task outcomes and self-judged experiences, then reports benchmark gains over raw-trajectory baselines. No equations, fitted parameters, or self-citation chains reduce the claimed improvements to inputs by construction. The derivation remains self-contained against the stated web-browsing and software-engineering evaluations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that self-judged experiences yield extractable general strategies and on standard retrieval mechanisms whose hyperparameters are tuned to the reported benchmarks.

free parameters (1)

retrieval and distillation hyperparameters
Parameters controlling memory selection and strategy extraction are tuned to achieve the reported benchmark improvements.

axioms (1)

domain assumption Agent self-judgment of task success and failure supplies sufficiently accurate signals for distilling reusable strategies
The framework description relies on this judgment step to separate useful from non-useful experiences.

invented entities (1)

ReasoningBank no independent evidence
purpose: Repository for distilled reasoning strategies
New memory abstraction introduced to store and retrieve generalized strategies rather than raw trajectories.

pith-pipeline@v0.9.0 · 5612 in / 1387 out tokens · 44454 ms · 2026-05-15T05:38:32.096885+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent’s self-judged successful and failed experiences... LLM-as-a-Judge... memory retrieval... MaTTS
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extract... memory items... Title... Description... Content... success insights... failure reflection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
cs.CL 2026-05 unverdicted novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
cs.AI 2026-05 unverdicted novelty 7.0

DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data,...
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
cs.CL 2026-05 unverdicted novelty 6.0

SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
cs.CL 2026-05 unverdicted novelty 6.0

SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Workspace Optimization: How to Train Your Agent
cs.AI 2026-05 unverdicted novelty 6.0

Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees...
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
cs.AI 2026-04 unverdicted novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
ReflectCAP: Detailed Image Captioning with Reflective Memory
cs.AI 2026-04 unverdicted novelty 6.0

ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-cov...
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience
cs.LG 2026-03 unverdicted novelty 6.0

WorkflowGen reuses trajectory experiences via node-level and workflow-level extraction plus three-tier semantic routing to cut token use over 40% and raise success 20% on medium-similarity queries versus real-time pla...
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
cs.CR 2026-05 unverdicted novelty 5.0

SafeHarbor uses hierarchical memory with adversarial rule extraction and entropy-driven self-evolution to achieve over 93% refusal on harmful requests while reaching 63.6% benign utility on GPT-4o.
Training-Free Test-Time Contrastive Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
ActionNex: A Virtual Outage Manager for Cloud Computing
cs.AI 2026-04 unverdicted novelty 4.0

ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · cited by 21 Pith papers

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

Rulin Shao and Rui Qiao and Varsha Kishore and Niklas Muennighoff and Xi Victoria Lin and Daniela Rus and Bryan Kian Hsiang Low and Sewon Min and Wen-tau Yih and Pang Wei Koh and Luke Zettlemoyer , booktitle =. Reason

work page
[4]

MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations , url =

Lumer, Elias and Gulati, Anmol and Subbiah, Vamse Kumar and Basavaraju, Pradeep Honaganahalli and Burke, James A , journal =. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations , url =

work page
[5]

Human-inspired Episodic Memory for Infinite Context

Zafeirios Fountas and Martin Benfeghoul and Adnan Oomerjee and Fenia Christopoulou and Gerasimos Lampouras and Haitham Bou Ammar and Jun Wang , booktitle =. Human-inspired Episodic Memory for Infinite Context

work page
[6]

Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , bibsource =

John Yang and Carlos E. Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik Narasimhan and Ofir Press , bibsource =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, V...

work page 2024
[7]

Autonomous Evaluation and Refinement of Digital Agents , url =

Jiayi Pan and Yichi Zhang and Nicholas Tomlin and Yifei Zhou and Sergey Levine and Alane Suhr , booktitle =. Autonomous Evaluation and Refinement of Digital Agents , url =

work page
[8]

Learning Wisdom from Errors: Promoting LLM's Continual Relation Learning through Exploiting Error Cases , url =

Yin, Shaozhe and Guo, Jinyu and Shuang, Kai and Liu, Xia and Ou, Ruize , journal =. Learning Wisdom from Errors: Promoting LLM's Continual Relation Learning through Exploiting Error Cases , url =

work page
[9]

Contextual Experience Replay for Self-Improvement of Language Agents , url =

Liu, Yitao and Si, Chenglei and Narasimhan, Karthik R and Yao, Shunyu , booktitle =. Contextual Experience Replay for Self-Improvement of Language Agents , url =. doi:10.18653/v1/2025.acl-long.694 , editor =

work page doi:10.18653/v1/2025.acl-long.694 2025
[10]

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , url =

Wang, Haozhe and Xu, Qixin and Liu, Che and Wu, Junhong and Lin, Fangzhen and Chen, Wenhu , journal =. Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , url =

work page
[11]

ArXiv preprint , title =

Lee, Jinhyuk and Chen, Feiyang and Dua, Sahil and Cer, Daniel and Shanbhogue, Madhuri and Naim, Iftekhar and. ArXiv preprint , title =

work page
[12]

SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning , url =

Ghafarollahi, Alireza and Buehler, Markus J , journal =. SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning , url =

work page
[13]

A survey on llm-as-a-judge , url =

Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and others , journal =. A survey on llm-as-a-judge , url =

work page
[14]

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience , url =

Sun, Zeyi and Liu, Ziyu and Zang, Yuhang and Cao, Yuhang and Dong, Xiaoyi and Wu, Tong and Lin, Dahua and Wang, Jiaqi , journal =. SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience , url =

work page
[15]

SWE-Exp: Experience-Driven Software Issue Resolution , url =

Chen, Silin and Lin, Shaoxin and Gu, Xiaodong and Shi, Yuling and Lian, Heng and Yun, Longfei and Chen, Dong and Sun, Weiguo and Cao, Lin and Wang, Qianxiang , journal =. SWE-Exp: Experience-Driven Software Issue Resolution , url =

work page
[16]

Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , bibsource =

Izzeddin Gur and Hiroki Furuta and Austin V. Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , bibsource =. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis , url =. The Twelfth International Conference on Learning Representations,

work page
[17]

Self-Refine: Iterative Refinement with Self-Feedback , url =

Aman Madaan and Niket Tandon and Prakhar Gupta and Skyler Hallinan and Luyu Gao and Sarah Wiegreffe and Uri Alon and Nouha Dziri and Shrimai Prabhumoye and Yiming Yang and Shashank Gupta and Bodhisattwa Prasad Majumder and Katherine Hermann and Sean Welleck and Amir Yazdanbakhsh and Peter Clark , bibsource =. Self-Refine: Iterative Refinement with Self-Fe...

work page 2023
[18]

Hinton , bibsource =

Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey E. Hinton , bibsource =. A Simple Framework for Contrastive Learning of Visual Representations , url =. Proceedings of the 37th International Conference on Machine Learning,

work page
[19]

Narasimhan and Yuan Cao , bibsource =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , bibsource =. ReAct: Synergizing Reasoning and Acting in Language Models , url =. The Eleventh International Conference on Learning Representations,

work page
[20]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , url =

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and Ram, Ori and Zhang, Dan and Rosen, Evan and others , journal =. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , url =

work page
[21]

Thinking vs

Shen, Junhong and Bai, Hao and Zhang, Lunjun and Zhou, Yifei and Setlur, Amrith and Tong, Shengbang and Caples, Diego and Jiang, Nan and Zhang, Tong and Talwalkar, Ameet and others , journal =. Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction , url =

work page
[22]

Agent kb: Leveraging cross-domain experience for agentic problem solving , url =

Tang, Xiangru and Qin, Tianrui and Peng, Tianhao and Zhou, Ziyang and Shao, Daniel and Du, Tingting and Wei, Xinming and Xia, Peng and Wu, Fang and Zhu, He and others , journal =. Agent kb: Leveraging cross-domain experience for agentic problem solving , url =

work page
[23]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , url =

Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu , bibsource =. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real...

work page 2024
[24]

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , url =

Liu, Bang and Li, Xinfeng and Zhang, Jiayi and Wang, Jinlin and He, Tanjin and Hong, Sirui and Liu, Hongzhang and Zhang, Shaokun and Song, Kaitao and Zhu, Kunlun and others , journal =. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , url =

work page
[25]

The Thirteenth International Conference on Learning Representations , title =

Antonis Antoniades and Albert. The Thirteenth International Conference on Learning Representations , title =

work page
[26]

Memp: Exploring Agent Procedural Memory , url =

Fang, Runnan and Liang, Yuan and Wang, Xiaobin and Wu, Jialong and Qiao, Shuofei and Xie, Pengjun and Huang, Fei and Chen, Huajun and Zhang, Ningyu , journal =. Memp: Exploring Agent Procedural Memory , url =

work page
[27]

Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , bibsource =. SWE-bench: Can Language Models Resolve Real-world Github Issues? , url =. The Twelfth International Conference on Learning Representations,

work page
[28]

2025 , issue_date =

Zhang, Zeyu and Dai, Quanyu and Bo, Xiaohe and Ma, Chen and Li, Rui and Chen, Xu and Zhu, Jieming and Dong, Zhenhua and Wen, Ji-Rong , title =. 2025 , issue_date =. doi:10.1145/3748302 , journal =

work page doi:10.1145/3748302 2025
[29]

MemoryBank: Enhancing large language models with long-term memory

Wanjun Zhong and Lianghong Guo and Qiqi Gao and He Ye and Yanlin Wang , bibsource =. MemoryBank: Enhancing Large Language Models with Long-Term Memory , url =. Thirty-Eighth. doi:10.1609/AAAI.V38I17.29946 , editor =

work page doi:10.1609/aaai.v38i17.29946
[30]

Beyond Goldfish Memory: Long-Term Open-Domain Conversation , url =

Xu, Jing and Szlam, Arthur and Weston, Jason , booktitle =. Beyond Goldfish Memory: Long-Term Open-Domain Conversation , url =. doi:10.18653/v1/2022.acl-long.356 , editor =

work page doi:10.18653/v1/2022.acl-long.356 2022
[31]

Investigate-consolidate-exploit: A general strategy for inter-task agent self-evolution , url =

Qian, Cheng and Liang, Shihao and Qin, Yujia and Ye, Yining and Cong, Xin and Lin, Yankai and Wu, Yesai and Liu, Zhiyuan and Sun, Maosong , journal =. Investigate-consolidate-exploit: A general strategy for inter-task agent self-evolution , url =

work page
[32]

ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning , url =

Xiangru Tang and Tianyu Hu and Muyang Ye and Yanjun Shao and Xunjian Yin and Siru Ouyang and Wangchunshu Zhou and Pan Lu and Zhuosheng Zhang and Yilun Zhao and Arman Cohan and Mark Gerstein , booktitle =. ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning , url =

work page
[33]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , url =

Zhou, Zijian and Qu, Ao and Wu, Zhaoxuan and Kim, Sunghwan and Prakash, Alok and Rus, Daniela and Zhao, Jinhua and Low, Bryan Kian Hsiang and Liang, Paul Pu , journal =. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , url =

work page
[34]

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , url =

Yu, Hongli and Chen, Tinghong and Feng, Jiangtao and Chen, Jiangjie and Dai, Weinan and Yu, Qiying and Zhang, Ya-Qin and Ma, Wei-Ying and Liu, Jingjing and Wang, Mingxuan and others , journal =. MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , url =

work page
[35]

MemGPT: Towards LLMs as Operating Systems , url =

Packer, Charles and Fang, Vivian and Patil, Shishir\_G and Lin, Kevin and Wooders, Sarah and Gonzalez, Joseph\_E , journal =. MemGPT: Towards LLMs as Operating Systems , url =

work page
[36]

Mem0: Building production-ready ai agents with scalable long-term memory , url =

Chhikara, Prateek and Khant, Dev and Aryan, Saket and Singh, Taranjeet and Yadav, Deshraj , journal =. Mem0: Building production-ready ai agents with scalable long-term memory , url =

work page
[37]

A-mem: Agentic memory for llm agents , url =

Xu, Wujiang and Liang, Zujie and Mei, Kai and Gao, Hang and Tan, Juntao and Zhang, Yongfeng , journal =. A-mem: Agentic memory for llm agents , url =

work page
[38]

M+: Extending Memory

Yu Wang and Dmitry Krotov and Yuanzhe Hu and Yifan Gao and Wangchunshu Zhou and Julian McAuley and Dan Gutfreund and Rogerio Feris and Zexue He , booktitle =. M+: Extending Memory

work page
[39]

Mind2Web: Towards a Generalist Agent for the Web , url =

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samual Stevens and Boshi Wang and Huan Sun and Yu Su , bibsource =. Mind2Web: Towards a Generalist Agent for the Web , url =. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, ...

work page 2023
[40]

Evaluating Memory in

Yuanzhe Hu and Yu Wang and Julian McAuley , booktitle =. Evaluating Memory in

work page
[41]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , url =

Di Wu and Hongwei Wang and Wenhao Yu and Yuwei Zhang and Kai-Wei Chang and Dong Yu , booktitle =. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , url =

work page
[42]

Evaluating Very Long-Term Conversational Memory of

Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei , booktitle =. Evaluating Very Long-Term Conversational Memory of. doi:10.18653/v1/2024.acl-long.747 , editor =

work page doi:10.18653/v1/2024.acl-long.747 2024
[43]

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents , url =

Tan, Zhen and Yan, Jun and Hsu, I-Hung and Han, Rujun and Wang, Zifeng and Le, Long and Song, Yiwen and Chen, Yanfei and Palangi, Hamid and Lee, George and Iyer, Anand Rajan and Chen, Tianlong and Liu, Huan and Lee, Chen-Yu and Pfister, Tomas , booktitle =. In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents...

work page doi:10.18653/v1/2025.acl-long.413 2025
[44]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=

work page 2025
[45]

Reinforcement learning: An introduction , volume =

Sutton, Richard S and Barto, Andrew G and others , number =. Reinforcement learning: An introduction , volume =

work page
[46]

Scaling Test-time Compute for LLM Agents , url =

Zhu, King and Li, Hanhao and Wu, Siwei and Xing, Tianshun and Ma, Dehua and Tang, Xiangru and Liu, Minghao and Yang, Jian and Liu, Jiaheng and Jiang, Yuchen Eleanor and others , journal =. Scaling Test-time Compute for LLM Agents , url =

work page
[47]

Two heads are better than one: Test-time scaling of multi-agent collaborative reasoning , url =

Jin, Can and Peng, Hongwu and Zhang, Qixin and Tang, Yujin and Metaxas, Dimitris N and Che, Tong , journal =. Two heads are better than one: Test-time scaling of multi-agent collaborative reasoning , url =

work page
[48]

Xiao Yu and Baolin Peng and Vineeth Vajipey and Hao Cheng and Michel Galley and Jianfeng Gao and Zhou Yu , booktitle =. Ex

work page
[49]

Scaling Test-Time Compute Without Verification or

Amrith Setlur and Nived Rajaraman and Sergey Levine and Aviral Kumar , booktitle =. Scaling Test-Time Compute Without Verification or

work page
[50]

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , url =

Wu, Yangzhen and Sun, Zhiqing and Li, Shanda and Welleck, Sean and Yang, Yiming , journal =. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , url =

work page
[51]

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models , url =

Yinlam Chow and Guy Tennenholtz and Izzeddin Gur and Vincent Zhuang and Bo Dai and Aviral Kumar and Rishabh Agarwal and Sridhar Thiagarajan and Craig Boutilier and Aleksandra Faust , booktitle =. Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models , url =

work page
[52]

s1: Simple test-time scaling

Muennighoff, Niklas and Yang, Zitong and Shi, Weijia and Li, Xiang Lisa and Fei-Fei, Li and Hajishirzi, Hannaneh and Zettlemoyer, Luke and Liang, Percy and Candes, Emmanuel and Hashimoto, Tatsunori. s1: Simple test-time scaling. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1025

work page doi:10.18653/v1/2025.emnlp-main.1025 2025
[53]

Z1: Efficient test-time scaling with code , url =

Yu, Zhaojian and Wu, Yinghao and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping , journal =. Z1: Efficient test-time scaling with code , url =

work page
[54]

S*: Test time scaling for code generation , url =

Li, Dacheng and Cao, Shiyi and Cao, Chengkun and Li, Xiuyu and Tan, Shangyin and Keutzer, Kurt and Xing, Jiarong and Gonzalez, Joseph E and Stoica, Ion , journal =. S*: Test time scaling for code generation , url =

work page
[55]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle =. Scaling

work page
[56]

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , url =

Longtao Zheng and Rundong Wang and Xinrun Wang and Bo An , bibsource =. Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , url =. The Twelfth International Conference on Learning Representations,

work page
[57]

Agent Workflow Memory , url =

Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , booktitle =. Agent Workflow Memory , url =

work page
[58]

WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks , url =

Miyai, Atsuyuki and Zhao, Zaiying and Egashira, Kazuki and Sato, Atsuki and Sunada, Tatsumi and Onohara, Shota and Yamanishi, Hiromasa and Toyooka, Mashiro and Nishina, Kunato and Maeda, Ryoma and others , journal =. WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks , url =

work page
[59]

The BrowserGym Ecosystem for Web Agent Research , url =

Thibault Le Sellier de Chezelles and Maxime Gasse and Alexandre Lacoste and Massimo Caccia and Alexandre Drouin and L. The BrowserGym Ecosystem for Web Agent Research , url =. Transactions on Machine Learning Research , note =

work page
[60]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , bibsource =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , bibsource =. WebArena:. The Twelfth International Conference on Learning Representations,

work page
[61]

Deep learning , url =

Ruslan Salakhutdinov , bibsource =. Deep learning , url =. The 20th. doi:10.1145/2623330.2630809 , editor =

work page doi:10.1145/2623330.2630809
[62]

Inducing programmatic skills for agentic tasks , url =

Wang, Zora Zhiruo and Gandhi, Apurva and Neubig, Graham and Fried, Daniel , journal =. Inducing programmatic skills for agentic tasks , url =

work page
[63]

A survey on large language model based autonomous agents , volume =

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and others , journal =. A survey on large language model based autonomous agents , volume =. doi:10.1007/s11704-024-40231-1 , url =

work page doi:10.1007/s11704-024-40231-1
[64]

StreamBench: Towards Benchmarking Continuous Improvement of Language Agents , url =

Cheng. StreamBench: Towards Benchmarking Continuous Improvement of Language Agents , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , editor =

work page 2024
[65]

Transactions on Machine Learning Research , issn=

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence , author=. Transactions on Machine Learning Research , issn=. 2026 , url=

work page 2026
[66]

Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents , url =

Kagaya, Tomoyuki and Yuan, Thong Jing and Lou, Yuxuan and Karlekar, Jayashree and Pranata, Sugiri and Kinose, Akira and Oguri, Koki and Wick, Felix and You, Yang , journal =. Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents , url =

work page
[67]

MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation , url =

Kong, Yi and Shi, Dianxi and Yang, Guoli and Huang, Chenlin and Li, Xiaopeng and Jin, Songchang and others , journal =. MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation , url =

work page
[68]

Andrew Zhao and Daniel Huang and Quentin Xu and Matthieu Lin and Yong. ExpeL:. Thirty-Eighth. doi:10.1609/AAAI.V38I17.29936 , editor =

work page doi:10.1609/aaai.v38i17.29936
[69]

In-Context Principle Learning from Mistakes , url =

Tianjun Zhang and Aman Madaan and Luyu Gao and Steven Zheng and Swaroop Mishra and Yiming Yang and Niket Tandon and Uri Alon , bibsource =. In-Context Principle Learning from Mistakes , url =. Forty-first International Conference on Machine Learning,

work page
[70]

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks , url =

Zexue He and Yu Wang and Churan Zhi and Yuanzhe Hu and Tzu-Ping Chen and Lang Yin and Ze Chen and Tong Arthur Wu and Siru Ouyang and Zihan Wang and Jiaxin Pei and Julian McAuley and Yejin Choi and Alex Pentland , journal =. MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks , url =

work page
[71]

MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions , url =

Liu, Yuxuan and Sun, Hongda and Liu, Wei and Luan, Jian and Du, Bo and Yan, Rui , booktitle =. MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions , url =. doi:10.1145/3690624.3709171 , isbn =

work page doi:10.1145/3690624.3709171
[72]

Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments , url =

Hongjin Su and Ruoxi Sun and Jinsung Yoon and Pengcheng Yin and Tao Yu and Sercan O Arik , booktitle =. Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments , url =

work page
[73]

Dynamic cheatsheet: Test-time learning with adaptive memory , url =

Suzgun, Mirac and Yuksekgonul, Mert and Bianchi, Federico and Jurafsky, Dan and Zou, James , journal =. Dynamic cheatsheet: Test-time learning with adaptive memory , url =

work page
[74]

No Need for Explanations: LLM s can implicitly learn from mistakes in-context

Alazraki, Lisa and Mozes, Maximilian and Campos, Jon Ander and Yi-Chern, Tan and Rei, Marek and Bartolo, Max. No Need for Explanations: LLM s can implicitly learn from mistakes in-context. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1686

work page doi:10.18653/v1/2025.emnlp-main.1686 2025
[75]

AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents , url =

Yao Fu and Dong. AutoGuide: Automated Generation and Selection of Context-Aware Guidelines for Large Language Model Agents , url =. Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 , editor =

work page 2024
[76]

Self-evolving Agents with reflective and memory-augmented abilities , url =

Liang, Xuechen and He, Yangfan and Xia, Yinghui and Song, Xinyuan and Wang, Jianhui and Tao, Meiling and Sun, Li and Yuan, Xinhang and Su, Jiayi and Li, Keqin and others , journal =. Self-evolving Agents with reflective and memory-augmented abilities , url =

work page
[77]

PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes , url =

Zhang, Xinliang Frederick and Beauchamp, Nick and Wang, Lu , journal =. PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes , url =

work page
[78]

doi:10.18653/v1/2025.acl-long.1575 , editor =

Hu, Mengkang and Chen, Tianxing and Chen, Qiguang and Mu, Yao and Shao, Wenqi and Luo, Ping , booktitle =. doi:10.18653/v1/2025.acl-long.1575 , editor =

work page doi:10.18653/v1/2025.acl-long.1575 2025
[79]

MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models , url =

Li, Zhiyu and Song, Shichao and Wang, Hanyu and Niu, Simin and Chen, Ding and Yang, Jiawei and Xi, Chenyang and Lai, Huayi and Zhao, Jihao and Wang, Yezhaohui and others , journal =. MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models , url =

work page
[80]

Memory in the age of ai agents , url =

Hu, Yuyang and Liu, Shichun and Yue, Yanwei and Zhang, Guibin and Liu, Boyang and Zhu, Fangyi and Lin, Jiahang and Guo, Honglin and Dou, Shihan and Xi, Zhiheng and others , journal =. Memory in the age of ai agents , url =

work page

Showing first 80 references.