arxiv: 2605.01567 · v1 · submitted 2026-05-02 · 💻 cs.SE · cs.CL· cs.LG

Recognition: unknown

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

Mehmet Iscan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 13:57 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords RL coding agentsdeveloper memoryModel Context Protocolcontextual banditoff-policy evaluationsafety gatesfeedback normalizationreinforcement learning

0 comments

The pith

A safety-gated memory architecture for RL coding agents reaches 80 percent expected-decision accuracy and full hard-negative suppression on a 200-case benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RL Developer Memory, a local-first architecture that treats memory selection for reinforcement-learning coding agents as a logged decision process. It ranks candidate memories, maps feedback labels to bounded rewards, and connects verified resolutions back to earlier retrievals so that small details can influence Bellman targets and validation claims without static vector stores or generic RAG. A deterministic ranker stays in control while a contextual-bandit policy runs only in shadow mode and can affect outputs only after conservative off-policy-evaluation gates pass. On a benchmark built from RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures, both the deterministic and full configurations deliver 80 percent expected accuracy and 100 percent hard-negative suppression, with the full version adding telemetry rather than accuracy gains. The authors frame the work as an auditable control system with explicit deployment boundaries rather than a universal performance claim.

Core claim

The paper establishes that feedback-normalized developer memory, implemented as an MCP-native architecture with issue_match ranking, issue_feedback reward mapping, and issue_record_resolution linking, enables reliable memory use for RL coding agents. Deterministic control and the full shadow/OPE configuration both achieve 80.0 percent expected-decision accuracy and 100.0 percent hard-negative suppression on the 200-case benchmark, while static validation passes 11/11 checks and dynamic integration passes 10/10 cases. The system requires theory-to-code metadata and review-gated governance for RL memories and reports that active learned-policy deployment and full MCP interoperability remain un

What carries the argument

The safety-gated MCP architecture consisting of a deterministic ranker in production, a contextual-bandit residual policy in shadow mode, and conservative OPE gates that only permit canary influence after evaluation.

If this is right

Deterministic control alone achieves 80.0 percent expected-decision accuracy and 100.0 percent hard-negative suppression.
The full shadow/OPE configuration matches the deterministic accuracy while adding learning telemetry.
Static validation passes all 11 checks and dynamic integration passes all 10 cases.
RL and control memories require theory-to-code metadata plus review-gated governance.
Active learned-policy deployment and official-client MCP interoperability remain unsupported.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gating approach could be tested on non-RL coding agents that still need auditable memory across long repository sessions.
A larger and more diverse benchmark drawn from actual developer workflows would clarify whether the current 40 residual non-RL failures are representative.
Extending the telemetry to influence future Bellman updates directly could close the loop between memory decisions and policy improvement.

Load-bearing premise

The 200-case deterministic benchmark, built around RL algorithm bugs, hard negatives, and review-gated cases, sufficiently represents the distribution of real long software-engineering episodes where memory details affect learning targets or validation claims.

What would settle it

A live full-configuration deployment in which the learned policy is activated and either accuracy drops below 80 percent, hard-negative suppression fails, or latency regresses beyond the reported limits.

Figures

Figures reproduced from arXiv: 2605.01567 by Mehmet Iscan.

**Figure 1.** Figure 1: System architecture. The deterministic ranker remains the deployed baseline; the contextual bandit runs in shadow mode and only supplies logged propensities and bounded residual scores unless the OPE gate authorizes a conservative low-risk canary. The live retrieval path begins with issue_match. The query is normalized into a structured profile, candidate memories are retrieved, a deterministic ranker orde… view at source ↗

**Figure 2.** Figure 2: Benchmark composition by algorithm family. The 40 non-RL cases are retained because the system is a developer-memory architecture, not only an RL-bug classifier. 3.3 Controlled baselines and ablation evidence view at source ↗

**Figure 3.** Figure 3: Same-commit deterministic control versus full configuration. Decision metrics are unchanged, while the full configuration writes contextual-statistics telemetry. 14 view at source ↗

**Figure 4.** Figure 4: Patch-replay raw deltas. These deltas are retained for engineering transparency but are not treated as controlled ablation evidence. suppression changed from 97.5% to 100.0%; online and live recall deltas moved from negative values to 0.0; contextual statistics update rate and live issue_match success were unchanged. The claim gate reports no metric as fully improved and marks latency as regressed. 3.7 Lat… view at source ↗

**Figure 5.** Figure 5: Live tool-level p95 latency. The largest full-configuration increase is in metrics/reporting and total-case time, while issue-match p95 increases modestly. 17 view at source ↗

read the original abstract

Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic retrieval-augmented generation (RAG) are insufficient for reinforcement-learning (RL) code development, where small details can alter Bellman targets, terminal masks, gradient flow, or validation claims. This paper presents RL Developer Memory, a local-first, Model Context Protocol (MCP)-native developer-memory architecture for RL coding agents. It treats memory selection as a logged contextual decision process: issue_match ranks candidates and records telemetry, issue_feedback maps raw labels to bounded rewards, and issue_record_resolution links verified resolutions to earlier retrieval events. A deterministic ranker remains deployed, while a contextual-bandit residual policy runs in shadow mode and can affect canary behavior only through conservative off-policy-evaluation (OPE) gates. RL/control memories require theory-to-code metadata and review-gated governance. The system is evaluated on a deterministic 200-case benchmark with RL algorithm bugs, hard negatives, review-gated RL/control cases, and low-risk failures. In the same-commit comparison, deterministic control and full shadow/OPE both achieve 80.0% expected-decision accuracy and 100.0% hard-negative suppression; the full configuration adds learning telemetry rather than accuracy gain. Static validation passed 11/11 checks; dynamic integration passed 10/10 cases. The evidence reports limits: active learned-policy deployment and official-client MCP interoperability are unsupported, live full-configuration latency regresses, and 40 residual non-RL failures remain. The contribution is an auditable memory-control architecture with explicit claim boundaries, not a universal coding-agent improvement claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a bounded engineering paper on a safety-gated memory architecture for RL coding agents that matches its deterministic baseline on accuracy while adding telemetry and explicit controls.

read the letter

The paper's core contribution is RL Developer Memory, a local MCP-native system that treats memory selection as a logged contextual process with issue_match ranking, feedback-to-reward mapping, and resolution linking. It keeps a deterministic ranker live and runs a contextual-bandit residual policy only in shadow mode behind conservative OPE gates, plus review-gated governance for RL/control memories. That combination of feedback normalization, telemetry logging, and safety mechanisms is the new piece; prior work on RAG or vector stores for coding agents does not appear to have this exact auditable, gated setup for long-horizon RL episodes. The evaluation on a 200-case deterministic benchmark reports 80% expected-decision accuracy and 100% hard-negative suppression for both the deterministic control and the full shadow/OPE configuration, with the full version supplying learning telemetry rather than higher accuracy. Static checks passed 11/11 and dynamic integration 10/10. The authors are clear about the limits: no active learned-policy deployment, no official MCP client interoperability, latency regression in the full config, and 40 residual non-RL failures. That transparency is useful. The soft spots are proportionate. The benchmark is synthetic and built around RL-specific bugs, hard negatives, and review-gated cases; while the construction is described, it still leaves open whether the distribution matches the variability of real multi-hour software-engineering traces where small memory choices shift Bellman targets or validation outcomes. No detailed baseline comparisons, statistical tests, or variance numbers are given beyond the headline figures. Performance parity with the deterministic version means the practical gain is mainly auditability and future learning hooks rather than immediate decision quality. This paper is for researchers and engineers working on RL agents for code generation who need controlled memory rather than for the broader RL or software-engineering communities. A reader looking for concrete patterns with explicit safety boundaries will find value. It deserves peer review because the claims stay within the stated scope, the evaluation is reproducible in principle, and the engineering choices are laid out plainly.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces RL Developer Memory, a local-first MCP-native architecture for persistent memory in RL coding agents. Memory selection is modeled as a logged contextual decision process with issue_match for ranking and telemetry, issue_feedback for mapping labels to bounded rewards, and issue_record_resolution for linking verified outcomes. A deterministic ranker operates in production while a contextual-bandit residual policy runs in shadow mode, subject to conservative OPE gates and review-gated governance for RL/control memories. The system is evaluated on a deterministic 200-case benchmark containing RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures. In same-commit comparisons, both deterministic control and full shadow/OPE configurations achieve 80.0% expected-decision accuracy and 100.0% hard-negative suppression; the full configuration supplies learning telemetry without accuracy improvement. Static validation passes 11/11 checks and dynamic integration passes 10/10 cases, with explicit disclaimers that active learned-policy deployment, official-client MCP interoperability, and latency improvements are unsupported.

Significance. If the reported benchmark results hold, the work supplies a practical, auditable pattern for safely incorporating memory and residual learning into LLM coding agents. Its emphasis on explicit safety gates (shadow mode, OPE, review gating), transparent synthetic benchmark construction, and bounded claims (performance parity rather than superiority, plus listed limitations) provides a useful template for responsible engineering in the RL-for-code domain. The contribution is primarily architectural and governance-oriented rather than a universal performance advance.

major comments (1)

[Evaluation] Evaluation (benchmark description): While the abstract states that the 200-case benchmark includes RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures, the manuscript provides no quantitative breakdown of case construction, difficulty distribution, or how hard negatives were defined and validated. This detail is load-bearing for interpreting the 80% accuracy and 100% suppression figures as representative of the targeted RL memory issues.

minor comments (2)

[Evaluation] The abstract and evaluation sections would benefit from a brief table summarizing the 200 cases by category (e.g., bug type, risk level) to improve reproducibility without expanding scope.
Notation for the contextual-bandit residual policy and OPE gates could be clarified with a short pseudocode listing of the decision flow, as the textual description of shadow-mode gating is dense.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. The single major comment concerns the lack of quantitative detail in the benchmark description; we address it directly below.

read point-by-point responses

Referee: While the abstract states that the 200-case benchmark includes RL algorithm bugs, hard negatives, review-gated cases, and low-risk failures, the manuscript provides no quantitative breakdown of case construction, difficulty distribution, or how hard negatives were defined and validated. This detail is load-bearing for interpreting the 80% accuracy and 100% suppression figures as representative of the targeted RL memory issues.

Authors: We agree that the current manuscript does not supply a quantitative breakdown of the 200-case benchmark composition, the definition and validation criteria for hard negatives, or difficulty distribution. This information would improve interpretability of the reported metrics. In the revised version we will add a new subsection (or table) under Evaluation that enumerates the case counts by category, describes the deterministic construction process, specifies how hard negatives were identified and validated, and reports any available difficulty or coverage statistics. These additions will be placed before the accuracy and suppression results so readers can assess representativeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript describes an MCP-native memory architecture for RL coding agents, with components (issue_match ranking, issue_feedback reward mapping, issue_record_resolution linking) presented as design choices rather than derived quantities. Core performance numbers (80% expected-decision accuracy, 100% hard-negative suppression) are obtained from direct same-commit evaluation on an explicitly constructed 200-case deterministic benchmark; no equations, fitted parameters, or self-referential definitions appear in the abstract or provided text that would reduce these reported figures back to the input data by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to support the central claims. The derivation chain is therefore self-contained against the stated benchmark and explicit scope limitations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The architecture introduces several new named components whose internal parameters and reward-mapping functions are not detailed in the abstract; no explicit free parameters, axioms, or invented entities are enumerated beyond the high-level design.

invented entities (1)

RL Developer Memory no independent evidence
purpose: Persistent, auditable memory store for long-horizon RL coding agents
New named system presented as the central contribution; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5615 in / 1363 out tokens · 49943 ms · 2026-05-09T13:57:03.190492+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 32 canonical work pages · 6 internal anchors

[1]

A method for evaluating hyperparameter sensitivity in reinforcement learning

Jacob Adkins, Michael Bowling, and Adam White. A method for evaluating hyperparameter sensitivity in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 37, pages 124820–124842, 2024. doi: 10.48550/arXiv.2412.07165. URL https: //arxiv.org/abs/2412.07165

work page doi:10.48550/arxiv.2412.07165 2024
[2]

Traces of memorisation in large language models for code

Ali Al-Kaswan, Maliheh Izadi, and Arie Van Deursen. Traces of memorisation in large language models for code. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–12, 2024. doi: 10.1145/3597503.3639133. URL https: //doi.org/10.1145/3597503.3639133. 21

work page doi:10.1145/3597503.3639133 2024
[3]

Doubly robust policy evaluation and learning

Miroslav Dudık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. InProceedings of the 28th International Conference on Machine Learning, pages 1097–1104,
[4]

Doubly Robust Policy Evaluation and Learning

doi: 10.48550/arXiv.1103.4601. URLhttps://arxiv.org/abs/1103.4601

work page Pith review doi:10.48550/arxiv.1103.4601
[5]

Hyperparameters in reinforcement learning and how to tune them

Theresa Eimer, Marius Lindauer, and Roberta Raileanu. Hyperparameters in reinforcement learning and how to tune them. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 9104–9149, 2023. doi: 10.48550/arXiv.2306.01324. URLhttps://proceedings.mlr.press/v202/eimer23a.html

work page doi:10.48550/arxiv.2306.01324 2023
[6]

GRACG: Graph retrieval augmented code generation

Konstantin Fedorov, Boris Zarubin, and Vladimir Ivanov. GRACG: Graph retrieval augmented code generation. In2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops, pages 291–298, 2025. doi: 10.1109/ASEW67777.2025.00060. URL https://doi.org/10.1109/ASEW67777.2025.00060

work page doi:10.1109/asew67777.2025.00060 2025
[7]

Banditrank: Learning to rank using contextual bandits

Phanideep Gampa and Sumio Fujita. Banditrank: Learning to rank using contextual bandits. InPacific-Asia Conference on Knowledge Discovery and Data Mining, pages 259–271. Springer,
[8]

URL https://doi.org/10.1007/978-3-030-757 68-7_21

doi: 10.1007/978-3-030-75768-7_21. URL https://doi.org/10.1007/978-3-030-757 68-7_21

work page doi:10.1007/978-3-030-75768-7_21
[9]

Stepping out of the shadows: Reinforcement learning in shadow mode.arXiv preprint arXiv:2410.23419, 2024

Patrick Gassert and Matthias Althoff. Stepping out of the shadows: Reinforcement learning in shadow mode.arXiv preprint arXiv:2410.23419, 2024. doi: 10.48550/arXiv.2410.23419. URL https://arxiv.org/abs/2410.23419

work page doi:10.48550/arxiv.2410.23419 2024
[10]

On the effectiveness of large language models in domain- specific code generation.ACM Transactions on Software Engineering and Methodology, 34(3): 78:1–78:28, 2025

Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. On the effectiveness of large language models in domain- specific code generation.ACM Transactions on Software Engineering and Methodology, 34(3): 78:1–78:28, 2025. doi: 10.1145/3697012. URLhttps://doi.org/10.1145/3697012

work page doi:10.1145/3697012 2025
[11]

Continuous safety assessment of updated supervised learning models in shadow mode

Hatem Guissouma, Markus Zink, and Eric Sax. Continuous safety assessment of updated supervised learning models in shadow mode. In2023 IEEE 20th International Conference on Software Architecture Companion, pages 301–308, 2023. doi: 10.1109/ICSA-C57050.2023.00069. URLhttps://doi.org/10.1109/ICSA-C57050.2023.00069

work page doi:10.1109/icsa-c57050.2023.00069 2023
[12]

Deep Reinforcement Learning that Matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3207–3214, 2018. doi: 10.48550/arXiv.1709.06560. URL https://arxiv.org/abs/1709.06560

work page Pith review doi:10.48550/arxiv.1709.06560 2018
[13]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2024. doi: 10.48550/arXiv.2310.06770. URL https://arxiv.org/abs/2310.06770. OpenReview: https://openreview.net/forum ?id=VTF8yNQM66

work page internal anchor Pith review doi:10.48550/arxiv.2310.06770 2024
[14]

Security threats in agentic ai system.arXiv preprint arXiv:2410.14728, 2024

Raihan Khan, Soumik Sarkar, Shubha Kanti Mahata, and Erwin Jose. Security threats in agentic ai system.arXiv preprint arXiv:2410.14728, 2024. doi: 10.48550/arXiv.2410.14728. URLhttps://arxiv.org/abs/2410.14728

work page doi:10.48550/arxiv.2410.14728 2024
[15]

Multi-agent reinforcement learning for interactive code debugging with human feedback and memory

Anjana Krishnamoorthy, Kartik Ivatury, and Benyamin Ahmadnia. Multi-agent reinforcement learning for interactive code debugging with human feedback and memory. InProceedings of the 15th International Conference on Recent Advances in Natural Language Processing - 22 Natural Language Processing in the Generative AI Era, pages 595–603, Varna, Bulgaria, 2025....

2025
[16]

Conservative q-learning for offline reinforcement learning, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020. doi: 10.48550/arXiv.2006.04779. URLhttps://arxiv.or g/abs/2006.04779

work page doi:10.48550/arxiv.2006.04779 2020
[17]

Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning.arXiv preprint...

work page doi:10.48550/arxiv.2409.12917 2024
[18]

Towards ethical personal ai applications: Practical considerations for ai assistants withlong-termmemory.arXiv preprint arXiv:2409.11192, 2024

Eunji Lee. Towards ethical personal ai applications: Practical considerations for ai assistants withlong-termmemory.arXiv preprint arXiv:2409.11192, 2024. doi: 10.48550/arXiv.2409.11192. URLhttps://arxiv.org/abs/2409.11192

work page doi:10.48550/arxiv.2409.11192 2024
[19]

doi:10.48550/a rXiv.2503.06034

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020. doi: 10.48550/...

work page doi:10.48550/a 2020
[20]

URLhttps://doi.org/10.1145/1772690.1772758

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010. doi: 10.1145/1772690.1772758. URL https: //doi.org/10.1145/1772690.1772758

work page doi:10.1145/1772690.1772758 2010
[21]

Liu, Kevin Lin, John Hewitt, Bhargavi Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URLhttps://doi.org/10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[22]

Normalization and effective learning rates in reinforcement learning

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning. In Advances in Neural Information Processing Systems, volume 37, pages 106440–106473, 2024. doi: 10.48550/arXiv.2407.01800. URLhttps://arxiv.org/abs/2407.01800

work page doi:10.48550/arxiv.2407.01800 2024
[23]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36,
[24]

Self-Refine: Iterative Refinement with Self-Feedback

doi: 10.48550/arXiv.2303.17651. URLhttps://arxiv.org/abs/2303.17651

work page internal anchor Pith review doi:10.48550/arxiv.2303.17651
[25]

Model context protocol specification, 2026

Model Context Protocol. Model context protocol specification, 2026. URLhttps://modelcon textprotocol.io/specification/2025-11-25. Accessed May 2, 2026

2026
[26]

Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering,

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika 23 Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, and Huan Wang. LoCoBench-Agent: An interactive benchmark for llm agents in ...

work page doi:10.48550/arxiv.2511.13998 2025
[27]

Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, and Xiangyang Ji. Latent reward: Llm-empowered credit assignment in episodic reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20095–20103,
[28]

Latent reward: Llm-empowered credit assignment in episodic reinforcement learning

doi: 10.1609/aaai.v39i19.34213. URL https://doi.org/10.1609/aaai.v39i19.34213

work page doi:10.1609/aaai.v39i19.34213
[29]

Exploring security risks and mitigation strategies in ai code helpers

Harshith Sheggam and Xiaowen Zhang. Exploring security risks and mitigation strategies in ai code helpers. In2024 IEEE Long Island Systems, Applications and Technology Conference, pages 1–6, 2024. doi: 10.1109/LISAT63094.2024.10807934. URLhttps://doi.org/10.1109/ LISAT63094.2024.10807934

work page doi:10.1109/lisat63094.2024.10807934 2024
[30]

Interactive recommendation via deep neural memory augmented contextual bandits

Yilin Shen, Yang Deng, Avik Ray, and Hongxia Jin. Interactive recommendation via deep neural memory augmented contextual bandits. InProceedings of the 12th ACM Conference on Recommender Systems, pages 122–130, 2018. doi: 10.1145/3240323.3240344. URL https://doi.org/10.1145/3240323.3240344

work page doi:10.1145/3240323.3240344 2018
[31]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. doi: 10.48550/arXiv.2303.11366. URL https://arxiv.org/abs/2303.11366. OpenReview: https://openreview.net/forum?id= vAElhFcKW6

work page internal anchor Pith review doi:10.48550/arxiv.2303.11366 2023
[32]

Investigating the bugs in reinforcement learning programs: Insights from stack overflow and github.Automated Software Engineering, 33(1):9, 2026

Jiayin Song, Yike Li, Yunzhe Tian, Haoxuan Ma, Honglei Li, Jie Zuo, Jiqiang Liu, and Wenjia Niu. Investigating the bugs in reinforcement learning programs: Insights from stack overflow and github.Automated Software Engineering, 33(1):9, 2026. doi: 10.1007/s10515-025-00555-z. URLhttps://doi.org/10.1007/s10515-025-00555-z

work page doi:10.1007/s10515-025-00555-z 2026
[33]

Counterfactual risk minimization: Learning from logged bandit feedback

Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 814–823, 2015. URL https://proceedings.mlr.press/v37/swaminathan15.html

2015
[34]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024. doi: 10.48550/arXiv.2410.10813. URLhttps://arxiv.org/abs/24 10.10813

work page internal anchor Pith review doi:10.48550/arxiv.2410.10813 2024
[35]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024. doi: 10.48550/arXiv.2405.15793. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review doi:10.48550/arxiv.2405.15793 2024
[36]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arXiv.2210.03629. URLhttps://arxiv.or g/abs/2210.03629. OpenReview:https://openreview.net/forum?id=WE_vluYUL-X. 24

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.03629 2023
[37]

COMBO: Conservative offline model-based policy optimization

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems, volume 34, pages 28954–28967, 2021. doi: 10.48550/arXiv.210 2.08363. URLhttps://arxiv.org/abs/2102.08363

work page doi:10.48550/arxiv.210 2021
[38]

Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, Weinan Zhang, and Muning Wen. MemRL: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026. doi: 10.48550/arXiv.2601.03192. URLhttps://arxiv.org/abs/2601.03192

work page doi:10.48550/arxiv.2601.03192 2026
[39]

MemoryBank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. doi: 10.1609/aaai.v38i17.29946. URLhttps://doi.org/10.1609/aaai.v38i17.29946. 25

work page doi:10.1609/aaai.v38i17.29946 2024