pith. machine review for the scientific record. sign in

arxiv: 2605.01567 · v1 · submitted 2026-05-02 · 💻 cs.SE · cs.CL· cs.LG

Recognition: unknown

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

Authors on Pith no claims yet

Pith reviewed 2026-05-09 13:57 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG
keywords RL coding agentsdeveloper memoryModel Context Protocolcontextual banditoff-policy evaluationsafety gatesfeedback normalizationreinforcement learning
0
0 comments X

The pith

A safety-gated memory architecture for RL coding agents reaches 80 percent expected-decision accuracy and full hard-negative suppression on a 200-case benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RL Developer Memory, a local-first architecture that treats memory selection for reinforcement-learning coding agents as a logged decision process. It ranks candidate memories, maps feedback labels to bounded rewards, and connects verified resolutions back to earlier retrievals so that small details can influence Bellman targets and validation claims without static vector stores or generic RAG. A deterministic ranker stays in control while a contextual-bandit policy runs only in shadow mode and can affect outputs only after conservative off-policy-evaluation gates pass. On a benchmark built from RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures, both the deterministic and full configurations deliver 80 percent expected accuracy and 100 percent hard-negative suppression, with the full version adding telemetry rather than accuracy gains. The authors frame the work as an auditable control system with explicit deployment boundaries rather than a universal performance claim.

Core claim

The paper establishes that feedback-normalized developer memory, implemented as an MCP-native architecture with issue_match ranking, issue_feedback reward mapping, and issue_record_resolution linking, enables reliable memory use for RL coding agents. Deterministic control and the full shadow/OPE configuration both achieve 80.0 percent expected-decision accuracy and 100.0 percent hard-negative suppression on the 200-case benchmark, while static validation passes 11/11 checks and dynamic integration passes 10/10 cases. The system requires theory-to-code metadata and review-gated governance for RL memories and reports that active learned-policy deployment and full MCP interoperability remain un

What carries the argument

The safety-gated MCP architecture consisting of a deterministic ranker in production, a contextual-bandit residual policy in shadow mode, and conservative OPE gates that only permit canary influence after evaluation.

If this is right

  • Deterministic control alone achieves 80.0 percent expected-decision accuracy and 100.0 percent hard-negative suppression.
  • The full shadow/OPE configuration matches the deterministic accuracy while adding learning telemetry.
  • Static validation passes all 11 checks and dynamic integration passes all 10 cases.
  • RL and control memories require theory-to-code metadata plus review-gated governance.
  • Active learned-policy deployment and official-client MCP interoperability remain unsupported.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gating approach could be tested on non-RL coding agents that still need auditable memory across long repository sessions.
  • A larger and more diverse benchmark drawn from actual developer workflows would clarify whether the current 40 residual non-RL failures are representative.
  • Extending the telemetry to influence future Bellman updates directly could close the loop between memory decisions and policy improvement.

Load-bearing premise

The 200-case deterministic benchmark, built around RL algorithm bugs, hard negatives, and review-gated cases, sufficiently represents the distribution of real long software-engineering episodes where memory details affect learning targets or validation claims.

What would settle it

A live full-configuration deployment in which the learned policy is activated and either accuracy drops below 80 percent, hard-negative suppression fails, or latency regresses beyond the reported limits.

Figures

Figures reproduced from arXiv: 2605.01567 by Mehmet Iscan.

Figure 1
Figure 1. Figure 1: System architecture. The deterministic ranker remains the deployed baseline; the contextual bandit runs in shadow mode and only supplies logged propensities and bounded residual scores unless the OPE gate authorizes a conservative low-risk canary. The live retrieval path begins with issue_match. The query is normalized into a structured profile, candidate memories are retrieved, a deterministic ranker orde… view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark composition by algorithm family. The 40 non-RL cases are retained because the system is a developer-memory architecture, not only an RL-bug classifier. 3.3 Controlled baselines and ablation evidence view at source ↗
Figure 3
Figure 3. Figure 3: Same-commit deterministic control versus full configuration. Decision metrics are unchanged, while the full configuration writes contextual-statistics telemetry. 14 view at source ↗
Figure 4
Figure 4. Figure 4: Patch-replay raw deltas. These deltas are retained for engineering transparency but are not treated as controlled ablation evidence. suppression changed from 97.5% to 100.0%; online and live recall deltas moved from negative values to 0.0; contextual statistics update rate and live issue_match success were unchanged. The claim gate reports no metric as fully improved and marks latency as regressed. 3.7 Lat… view at source ↗
Figure 5
Figure 5. Figure 5: Live tool-level p95 latency. The largest full-configuration increase is in metrics/reporting and total-case time, while issue-match p95 increases modestly. 17 view at source ↗
read the original abstract

Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic retrieval-augmented generation (RAG) are insufficient for reinforcement-learning (RL) code development, where small details can alter Bellman targets, terminal masks, gradient flow, or validation claims. This paper presents RL Developer Memory, a local-first, Model Context Protocol (MCP)-native developer-memory architecture for RL coding agents. It treats memory selection as a logged contextual decision process: issue_match ranks candidates and records telemetry, issue_feedback maps raw labels to bounded rewards, and issue_record_resolution links verified resolutions to earlier retrieval events. A deterministic ranker remains deployed, while a contextual-bandit residual policy runs in shadow mode and can affect canary behavior only through conservative off-policy-evaluation (OPE) gates. RL/control memories require theory-to-code metadata and review-gated governance. The system is evaluated on a deterministic 200-case benchmark with RL algorithm bugs, hard negatives, review-gated RL/control cases, and low-risk failures. In the same-commit comparison, deterministic control and full shadow/OPE both achieve 80.0% expected-decision accuracy and 100.0% hard-negative suppression; the full configuration adds learning telemetry rather than accuracy gain. Static validation passed 11/11 checks; dynamic integration passed 10/10 cases. The evidence reports limits: active learned-policy deployment and official-client MCP interoperability are unsupported, live full-configuration latency regresses, and 40 residual non-RL failures remain. The contribution is an auditable memory-control architecture with explicit claim boundaries, not a universal coding-agent improvement claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces RL Developer Memory, a local-first MCP-native architecture for persistent memory in RL coding agents. Memory selection is modeled as a logged contextual decision process with issue_match for ranking and telemetry, issue_feedback for mapping labels to bounded rewards, and issue_record_resolution for linking verified outcomes. A deterministic ranker operates in production while a contextual-bandit residual policy runs in shadow mode, subject to conservative OPE gates and review-gated governance for RL/control memories. The system is evaluated on a deterministic 200-case benchmark containing RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures. In same-commit comparisons, both deterministic control and full shadow/OPE configurations achieve 80.0% expected-decision accuracy and 100.0% hard-negative suppression; the full configuration supplies learning telemetry without accuracy improvement. Static validation passes 11/11 checks and dynamic integration passes 10/10 cases, with explicit disclaimers that active learned-policy deployment, official-client MCP interoperability, and latency improvements are unsupported.

Significance. If the reported benchmark results hold, the work supplies a practical, auditable pattern for safely incorporating memory and residual learning into LLM coding agents. Its emphasis on explicit safety gates (shadow mode, OPE, review gating), transparent synthetic benchmark construction, and bounded claims (performance parity rather than superiority, plus listed limitations) provides a useful template for responsible engineering in the RL-for-code domain. The contribution is primarily architectural and governance-oriented rather than a universal performance advance.

major comments (1)
  1. [Evaluation] Evaluation (benchmark description): While the abstract states that the 200-case benchmark includes RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures, the manuscript provides no quantitative breakdown of case construction, difficulty distribution, or how hard negatives were defined and validated. This detail is load-bearing for interpreting the 80% accuracy and 100% suppression figures as representative of the targeted RL memory issues.
minor comments (2)
  1. [Evaluation] The abstract and evaluation sections would benefit from a brief table summarizing the 200 cases by category (e.g., bug type, risk level) to improve reproducibility without expanding scope.
  2. Notation for the contextual-bandit residual policy and OPE gates could be clarified with a short pseudocode listing of the decision flow, as the textual description of shadow-mode gating is dense.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. The single major comment concerns the lack of quantitative detail in the benchmark description; we address it directly below.

read point-by-point responses
  1. Referee: While the abstract states that the 200-case benchmark includes RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures, the manuscript provides no quantitative breakdown of case construction, difficulty distribution, or how hard negatives were defined and validated. This detail is load-bearing for interpreting the 80% accuracy and 100% suppression figures as representative of the targeted RL memory issues.

    Authors: We agree that the current manuscript does not supply a quantitative breakdown of the 200-case benchmark composition, the definition and validation criteria for hard negatives, or difficulty distribution. This information would improve interpretability of the reported metrics. In the revised version we will add a new subsection (or table) under Evaluation that enumerates the case counts by category, describes the deterministic construction process, specifies how hard negatives were identified and validated, and reports any available difficulty or coverage statistics. These additions will be placed before the accuracy and suppression results so readers can assess representativeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript describes an MCP-native memory architecture for RL coding agents, with components (issue_match ranking, issue_feedback reward mapping, issue_record_resolution linking) presented as design choices rather than derived quantities. Core performance numbers (80% expected-decision accuracy, 100% hard-negative suppression) are obtained from direct same-commit evaluation on an explicitly constructed 200-case deterministic benchmark; no equations, fitted parameters, or self-referential definitions appear in the abstract or provided text that would reduce these reported figures back to the input data by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to support the central claims. The derivation chain is therefore self-contained against the stated benchmark and explicit scope limitations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The architecture introduces several new named components whose internal parameters and reward-mapping functions are not detailed in the abstract; no explicit free parameters, axioms, or invented entities are enumerated beyond the high-level design.

invented entities (1)
  • RL Developer Memory no independent evidence
    purpose: Persistent, auditable memory store for long-horizon RL coding agents
    New named system presented as the central contribution; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5615 in / 1363 out tokens · 49943 ms · 2026-05-09T13:57:03.190492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 32 canonical work pages · 6 internal anchors

  1. [1]

    A method for evaluating hyperparameter sensitivity in reinforcement learning

    Jacob Adkins, Michael Bowling, and Adam White. A method for evaluating hyperparameter sensitivity in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 37, pages 124820–124842, 2024. doi: 10.48550/arXiv.2412.07165. URL https: //arxiv.org/abs/2412.07165

  2. [2]

    Traces of memorisation in large language models for code

    Ali Al-Kaswan, Maliheh Izadi, and Arie Van Deursen. Traces of memorisation in large language models for code. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–12, 2024. doi: 10.1145/3597503.3639133. URL https: //doi.org/10.1145/3597503.3639133. 21

  3. [3]

    Doubly robust policy evaluation and learning

    Miroslav Dudık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. InProceedings of the 28th International Conference on Machine Learning, pages 1097–1104,

  4. [4]

    Doubly Robust Policy Evaluation and Learning

    doi: 10.48550/arXiv.1103.4601. URLhttps://arxiv.org/abs/1103.4601

  5. [5]

    Hyperparameters in reinforcement learning and how to tune them

    Theresa Eimer, Marius Lindauer, and Roberta Raileanu. Hyperparameters in reinforcement learning and how to tune them. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 9104–9149, 2023. doi: 10.48550/arXiv.2306.01324. URLhttps://proceedings.mlr.press/v202/eimer23a.html

  6. [6]

    GRACG: Graph retrieval augmented code generation

    Konstantin Fedorov, Boris Zarubin, and Vladimir Ivanov. GRACG: Graph retrieval augmented code generation. In2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops, pages 291–298, 2025. doi: 10.1109/ASEW67777.2025.00060. URL https://doi.org/10.1109/ASEW67777.2025.00060

  7. [7]

    Banditrank: Learning to rank using contextual bandits

    Phanideep Gampa and Sumio Fujita. Banditrank: Learning to rank using contextual bandits. InPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 259–271. Springer,

  8. [8]

    URL https://doi.org/10.1007/978-3-030-757 68-7_21

    doi: 10.1007/978-3-030-75768-7_21. URL https://doi.org/10.1007/978-3-030-757 68-7_21

  9. [9]

    Stepping out of the shadows: Reinforcement learning in shadow mode.arXiv preprint arXiv:2410.23419, 2024

    Patrick Gassert and Matthias Althoff. Stepping out of the shadows: Reinforcement learning in shadow mode.arXiv preprint arXiv:2410.23419, 2024. doi: 10.48550/arXiv.2410.23419. URL https://arxiv.org/abs/2410.23419

  10. [10]

    On the effectiveness of large language models in domain- specific code generation.ACM Transactions on Software Engineering and Methodology, 34(3): 78:1–78:28, 2025

    Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. On the effectiveness of large language models in domain- specific code generation.ACM Transactions on Software Engineering and Methodology, 34(3): 78:1–78:28, 2025. doi: 10.1145/3697012. URLhttps://doi.org/10.1145/3697012

  11. [11]

    Continuous safety assessment of updated supervised learning models in shadow mode

    Hatem Guissouma, Markus Zink, and Eric Sax. Continuous safety assessment of updated supervised learning models in shadow mode. In2023 IEEE 20th International Conference on Software Architecture Companion, pages 301–308, 2023. doi: 10.1109/ICSA-C57050.2023.00069. URLhttps://doi.org/10.1109/ICSA-C57050.2023.00069

  12. [12]

    Deep Reinforcement Learning that Matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3207–3214, 2018. doi: 10.48550/arXiv.1709.06560. URL https://arxiv.org/abs/1709.06560

  13. [13]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024. doi: 10.48550/arXiv.2310.06770. URL https://arxiv.org/abs/2310.06770. OpenReview: https://openreview.net/forum ?id=VTF8yNQM66

  14. [14]

    Security threats in agentic ai system.arXiv preprint arXiv:2410.14728, 2024

    Raihan Khan, Soumik Sarkar, Shubha Kanti Mahata, and Erwin Jose. Security threats in agentic ai system.arXiv preprint arXiv:2410.14728, 2024. doi: 10.48550/arXiv.2410.14728. URLhttps://arxiv.org/abs/2410.14728

  15. [15]

    Multi-agent reinforcement learning for interactive code debugging with human feedback and memory

    Anjana Krishnamoorthy, Kartik Ivatury, and Benyamin Ahmadnia. Multi-agent reinforcement learning for interactive code debugging with human feedback and memory. InProceedings of the 15th International Conference on Recent Advances in Natural Language Processing - 22 Natural Language Processing in the Generative AI Era, pages 595–603, Varna, Bulgaria, 2025....

  16. [16]

    Conservative q-learning for offline reinforcement learning, 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020. doi: 10.48550/arXiv.2006.04779. URLhttps://arxiv.or g/abs/2006.04779

  17. [17]

    Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

    Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning.arXiv preprint...

  18. [18]

    Towards ethical personal ai applications: Practical considerations for ai assistants withlong-termmemory.arXiv preprint arXiv:2409.11192, 2024

    Eunji Lee. Towards ethical personal ai applications: Practical considerations for ai assistants withlong-termmemory.arXiv preprint arXiv:2409.11192, 2024. doi: 10.48550/arXiv.2409.11192. URLhttps://arxiv.org/abs/2409.11192

  19. [19]

    doi:10.48550/a rXiv.2503.06034

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020. doi: 10.48550/...

  20. [20]

    URLhttps://doi.org/10.1145/1772690.1772758

    Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010. doi: 10.1145/1772690.1772758. URL https: //doi.org/10.1145/1772690.1772758

  21. [21]

    Liu, Kevin Lin, John Hewitt, Bhargavi Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URLhttps://doi.org/10.1162/tacl_a_00638

  22. [22]

    Normalization and effective learning rates in reinforcement learning

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning. In Advances in Neural Information Processing Systems, volume 37, pages 106440–106473, 2024. doi: 10.48550/arXiv.2407.01800. URLhttps://arxiv.org/abs/2407.01800

  23. [23]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36,

  24. [24]

    Self-Refine: Iterative Refinement with Self-Feedback

    doi: 10.48550/arXiv.2303.17651. URLhttps://arxiv.org/abs/2303.17651

  25. [25]

    Model context protocol specification, 2026

    Model Context Protocol. Model context protocol specification, 2026. URLhttps://modelcon textprotocol.io/specification/2025-11-25. Accessed May 2, 2026

  26. [26]

    Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering,

    Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika 23 Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, and Huan Wang. LoCoBench-Agent: An interactive benchmark for llm agents in ...

  27. [27]

    Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

    Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, and Xiangyang Ji. Latent reward: Llm-empowered credit assignment in episodic reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20095–20103,

  28. [28]

    Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

    doi: 10.1609/aaai.v39i19.34213. URL https://doi.org/10.1609/aaai.v39i19.34213

  29. [29]

    Exploring security risks and mitigation strategies in ai code helpers

    Harshith Sheggam and Xiaowen Zhang. Exploring security risks and mitigation strategies in ai code helpers. In2024 IEEE Long Island Systems, Applications and Technology Conference, pages 1–6, 2024. doi: 10.1109/LISAT63094.2024.10807934. URLhttps://doi.org/10.1109/ LISAT63094.2024.10807934

  30. [30]

    Interactive recommendation via deep neural memory augmented contextual bandits

    Yilin Shen, Yang Deng, Avik Ray, and Hongxia Jin. Interactive recommendation via deep neural memory augmented contextual bandits. InProceedings of the 12th ACM Conference on Recommender Systems, pages 122–130, 2018. doi: 10.1145/3240323.3240344. URL https://doi.org/10.1145/3240323.3240344

  31. [31]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. doi: 10.48550/arXiv.2303.11366. URL https://arxiv.org/abs/2303.11366. OpenReview: https://openreview.net/forum?id= vAElhFcKW6

  32. [32]

    Investigating the bugs in reinforcement learning programs: Insights from stack overflow and github.Automated Software Engineering, 33(1):9, 2026

    Jiayin Song, Yike Li, Yunzhe Tian, Haoxuan Ma, Honglei Li, Jie Zuo, Jiqiang Liu, and Wenjia Niu. Investigating the bugs in reinforcement learning programs: Insights from stack overflow and github.Automated Software Engineering, 33(1):9, 2026. doi: 10.1007/s10515-025-00555-z. URLhttps://doi.org/10.1007/s10515-025-00555-z

  33. [33]

    Counterfactual risk minimization: Learning from logged bandit feedback

    Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 814–823, 2015. URL https://proceedings.mlr.press/v37/swaminathan15.html

  34. [34]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. doi: 10.48550/arXiv.2410.10813. URLhttps://arxiv.org/abs/24 10.10813

  35. [35]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. doi: 10.48550/arXiv.2405.15793. URLhttps://arxiv.org/abs/2405.15793

  36. [36]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2210.03629. URLhttps://arxiv.or g/abs/2210.03629. OpenReview:https://openreview.net/forum?id=WE_vluYUL-X. 24

  37. [37]

    COMBO: Conservative offline model-based policy optimization

    Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems, volume 34, pages 28954–28967, 2021. doi: 10.48550/arXiv.210 2.08363. URLhttps://arxiv.org/abs/2102.08363

  38. [38]

    Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

    Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Weinan Zhang, and Muning Wen. MemRL: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026. doi: 10.48550/arXiv.2601.03192. URLhttps://arxiv.org/abs/2601.03192

  39. [39]

    MemoryBank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. doi: 10.1609/aaai.v38i17.29946. URLhttps://doi.org/10.1609/aaai.v38i17.29946. 25