pith. sign in

arxiv: 2606.05646 · v1 · pith:7HAMS7ARnew · submitted 2026-06-04 · 💻 cs.SE · cs.AI

Enhancing Software Engineering Through Closed-Loop Memory Optimization

Pith reviewed 2026-06-28 00:40 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords memory augmentationsoftware engineering agentsLLM agentsclosed-loop optimizationdownstream impacttask-agnostic evaluation
0
0 comments X

The pith

Closed-loop memory framework defines utility from downstream task impact to improve SE agents without labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that software engineering agents, which are currently episodic and reconstruct context from scratch on every task, can be made to retain and refine experiences through a closed-loop memory system. Memory utility is grounded directly in whether past experiences improve success on new tasks, turning that impact into both a benchmark for evaluation and a signal for optimization. This approach requires no task-specific knowledge or manual annotations. Sympathetic readers would care because it offers a general way to reduce repeated mistakes and computational waste in agents that navigate codebases. Experiments across single-episode and cross-episode settings show consistent gains.

Core claim

Ours is a closed-loop framework that grounds memory utility in validated downstream impact, establishing it as a task-agnostic evaluation benchmark and annotation-free optimization signal; complementary evaluation on single-episode and cross-episode memory augmentation shows it improves SE agents with absolute gains of up to 5.25% success rate and 4.63% resolve efficiency while reducing computational cost by at least 9.79%.

What carries the argument

Ours, the closed-loop framework that treats validated downstream task impact as the sole definition of memory utility for both evaluation and optimization.

If this is right

  • SE agents can retain and reuse experiences across tasks instead of reconstructing context each time.
  • Memory selection becomes possible without task-specific annotations or human labels.
  • Performance gains appear in both single-task and multi-task memory augmentation settings.
  • Computational cost drops while success rate and resolve efficiency rise.
  • The same utility signal serves simultaneously as an evaluation metric and an optimization objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same downstream-impact definition could be applied to memory management in LLM agents outside software engineering.
  • If the utility signal proves stable, it could support iterative refinement of memory stores over many episodes without external supervision.
  • Cross-episode augmentation may compound gains over time if early improvements feed into later memory selections.

Load-bearing premise

Memory utility can be defined in a task-agnostic way solely from validated downstream task impact without needing task-specific knowledge or manual labels.

What would settle it

A controlled experiment in which memory selected purely by downstream impact either fails to raise success rate or raises computational cost when applied to a new set of SE tasks or a different base agent.

Figures

Figures reproduced from arXiv: 2606.05646 by Graham Neubig, Qingyun Wang, Xingyao Wang, Xuehang Guo, Zora Zhiruo Wang.

Figure 1
Figure 1. Figure 1: Memory-Augmented Software Engi￾neering. Compared with no-Mθ SE agent (left), MemOp (right) equips SE agents with adaptively distilled memories to better tackle dynamic real￾world SE challenges. The emergence of large language models has catalyzed a paradigm shift in software engi￾neering, enabling LLM agents capable of ad￾dressing complex real-world SE tasks (Jin et al., 2025; Guo et al., 2025b). Through t… view at source ↗
Figure 2
Figure 2. Figure 2: Memory Utility. We tackles the fundamental challenge (§2.2) by proposing memory utility with performance-grounded mem￾ory evaluation and trajectory-level rejection sampling. Without such measures, it is impossible to distinguish good memory from noise, or even to leverage qual￾ity signals to drive learn￾ing (§5&B). We tackle this with a concrete, outcome￾grounded definition: a mem￾ory is useful if and only… view at source ↗
Figure 3
Figure 3. Figure 3: Memory Model Finetuning. By preparing training datasets (Tab. 4) through trajectory￾based rejection sampling (§3.2), Mθ is finetuned through two-stage training via SFT and RL (§3.3). where Q(·) is a composite memory utility function combining all NQ multi-dimensional metrics (§3.1) across task performance and problem-solving efficiency, and M∗ θ is the optimal memory evolution function MemOp aims to learn.… view at source ↗
Figure 4
Figure 4. Figure 4: Adaptability to Different FT Stages. All training stages con￾tributes to performance improvements. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MemOp Algorithm Generalizability. MemOp is also useful when applying different RL algorithms. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Preliminary Analysis on SE Agent Failures. Through manual analysis, we identify seven failure patterns in SE agent problem-solving. How do SE agents fail in software engineering tasks? Understanding the nature of SE agent failures is critical for improving their problem￾solving. Motivated by this, we perform a pre￾liminary analysis on key failure patterns in SE agents, categorizing common causes to inform … view at source ↗
Figure 7
Figure 7. Figure 7: Failure Case of Repository Structure. The SE agent fails due to its incorrect understanding of the repository structure. Repetition EXECUTION RESULT of [execute_bash]: /opt/miniconda3/envs/testbed/bin/python: No module named pytest [The command completed with exit code 1.] [Current working directory: /workspace/django__django__3.0] [Python interpreter: /opt/miniconda3/envs/testbed/bin/python] [Command fini… view at source ↗
Figure 8
Figure 8. Figure 8: Failure Case of Repetition. The SE agent repeats the same error as in earlier attempts. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure Case of Reasoning Error. The SE agent fails due to its incorrect reasoning across multiple attempts. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Failure Case of Coding Error. The SE agent fails due to coding errors, such as SyntaxError, NameError, AttributeError, etc. Execution Error Let me try a completely different angle - what if `list(cv)` is still called somewhere else even in this fixed approach? But I think actually the fundamental understanding must be incorrect - I just realized maybe the problem isn't what I suspected. Looking carefully … view at source ↗
Figure 11
Figure 11. Figure 11: Failure Case of Execution Error. The SE agent fails due to execution errors. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Failure Case of Inconsistency. The SE agent fails due to the inconsistency between its reasoning and action. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Failure Case of Hallucination. The SE agent fails due to hallucinating actions or experiences. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Preliminary Study on Memory Instruction. In addition to qualitative analysis ( [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Memory Instruction. We conduct the preliminary study on memory reflection through three versions of instructions: (1) general and concise instruction, (2) high-level instruction, and (3) fine-grained instruction. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Repo-Wise Comparison between Baseline and MemOp. Compared to no-Mθ baseline SE agent (base LLM: Qwen3-Coder-30B-A3B), SE agent (base LLM: Qwen3-Coder-30B-A3B) with MemOp (backbone LLM: Qwen3-4B-T) consistently outper￾forms no-Mθ baseline across nine disparate repositories. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Instructions for Single-Episode and Cross-Episode Memory Generation. Our memory generation instructions for single-episode and cross-episode memory-augmented software engineering settings. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Evaluation Set Distribution During experiments, to avoid evaluation circu￾larity, we randomly sample 100 evaluation in￾stances that have no overlap with the 100 in￾stances used to construct our training dataset. In cross-episode evaluation, all instances of each repository are evaluated according to their tem￾poral order to simulate real-world codebase evo￾lution. As shown in [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 19
Figure 19. Figure 19: MemOp for Memory-Augmented Software Engineering. MemOp finetunes Mθ to augment SE agents through adaptive memory generation (§2.3 & §G.2). Our ablation studies include single-episode memory generation, episode-level memory evolution (§4), and action-level memory evolution (§G.2). G.1 Repository-Wise Generalizability of MemOp Extending our discussion in §4.3, [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Memory Evolution Granularity. We compare SE agent performance among no-Mθ, MemOp in cross-action, and MemOp in cross￾episode memory evolution settings. To investigate this fundamental question, we em￾ploy Qwen3-Coder-30B-A3B to power SE agent, with MemOp using Qwen3-4B-T as Mθ backbone. We evaluate memory-augmented software engineering on the same test set un￾der both action-level and episode-level memory… view at source ↗
Figure 21
Figure 21. Figure 21: Effects of Preference Rollout Batch Size Configura￾tion on Mθ Optimization. To investigate the effect of rollout batch size c in DRL (§3.2), we compare SE agent performance with Mθ using the same backbone LLM (Qwen3-4B-Thinking), finetuned on DRL with c = 2 and c = 4, respectively. Results in [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: MemOp Improves SE Performance with Reduced Computational Cost. MemOp enhances SE agent across single-episode and cross-episode settings with reduced computational cost. G.5 MemOp for More Robust Software Engineering To systematically compare the augmentation effects of MemOp, we use Qwen3-Coder-30B to power the SE agent with Qwen3-4B-T as the memory backbone of finetuned Mθ, and compare their mean perform… view at source ↗
Figure 23
Figure 23. Figure 23: MemOp Enhances SE Agent Performance Robustness. Error bars across all evaluation metrics demonstrate that MemOp consistently improves SE agent performance with reduced variance, reflecting greater robustness over the no-Mθ baseline. G.6 Qualitative Analysis on Memory Augmentation Success & Failure To better understand the effectiveness of Mθ in memory generation and evolution, we conduct case studies to q… view at source ↗
Figure 24
Figure 24. Figure 24: Examples of Effective Memory Reflection. The generated memories effectively support SE agent to successfully resolve the task with enhanced problem-solving efficiency. Examples extracted from SE agent powered by Qwen3-Coder-30B-A3B and MemOp with Mθ powered by Qwen3-4B-T (FT). Trajectory history details are omitted as [...] for clarity. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Examples of Ineffective Memory Reflection. The generated memories fail to effectively support SE agent. Examples extracted from SE agent powered by Qwen3-Coder-30B-A3B and MemOp with Mθ powered by Claude-4-Sonnet (NFT). Trajectory history details are omitted as [...] for clarity. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗
read the original abstract

Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real-world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task-agnostic \textit{memory utility}, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \ours, a closed-loop framework for memory augmentation in SE agents. \ours grounds memory utility in \textit{validated downstream impact}, establishing utility as both a task-agnostic \textbf{evaluation benchmark} and an annotation-free \textbf{optimization signal}. Through complementary evaluation on \textit{single-episode} and \textit{cross-episode} memory augmentation, results demonstrate that \ours consistently improves SE agents across settings, achieving absolute gains of up to $\uparrow5.25\%$ in success rate and $\uparrow4.63\%$ in resolve efficiency, while substantially reducing computational cost by $\geq9.79\%$. Our project page: \href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces \\\ours, a closed-loop framework for memory augmentation in SE agents. It grounds memory utility in validated downstream task impact, establishing it as both a task-agnostic evaluation benchmark and an annotation-free optimization signal. Complementary evaluations on single-episode and cross-episode memory augmentation are claimed to show consistent improvements, with absolute gains of up to ↑5.25% in success rate, ↑4.63% in resolve efficiency, and ≥9.79% reduction in computational cost.

Significance. If the results hold under proper verification, the closed-loop construction offers a task-agnostic way to optimize and evaluate memory in LLM-based SE agents, addressing their episodic limitations without requiring manual labels or task-specific knowledge. This could improve generalizability across agents and settings.

major comments (2)
  1. [Abstract] Abstract: The abstract states quantitative improvements (↑5.25% success rate, ↑4.63% resolve efficiency, ≥9.79% cost reduction) but supplies no experimental protocol, baselines, statistical tests, dataset details, or error analysis; the central performance claims cannot be evaluated from the given text.
  2. [Abstract] The closed-loop design risks circularity if task success is used both to define and to optimize the same memory utility signal; the manuscript must demonstrate that the downstream-impact signal is independent of the optimization loop (see weakest assumption in stress-test note).
minor comments (1)
  1. The project page link is provided but the manuscript should include a brief summary of what additional materials (code, data) are available there.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states quantitative improvements (↑5.25% success rate, ↑4.63% resolve efficiency, ≥9.79% cost reduction) but supplies no experimental protocol, baselines, statistical tests, dataset details, or error analysis; the central performance claims cannot be evaluated from the given text.

    Authors: The abstract is intentionally concise to summarize contributions and results. Full details on the experimental protocol, baselines (standard SE agents and memory-augmented variants), statistical tests, datasets (SE benchmarks used), and error analysis appear in Sections 4 and 5. We will revise the abstract to briefly reference the evaluation settings and primary baselines for improved clarity. revision: yes

  2. Referee: [Abstract] The closed-loop design risks circularity if task success is used both to define and to optimize the same memory utility signal; the manuscript must demonstrate that the downstream-impact signal is independent of the optimization loop (see weakest assumption in stress-test note).

    Authors: We appreciate the concern about potential circularity. The memory utility is computed from downstream impact on a held-out validation task set that is excluded from the optimization loop; the loop then uses this precomputed signal for memory selection, while final gains are measured on disjoint test tasks. This separation maintains independence. We will add explicit discussion of the separation and address the noted stress-test assumptions in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly frames its core contribution as a closed-loop construction that defines memory utility directly from validated downstream task impact and then uses that same impact as both benchmark and optimization signal. This is presented as an intentional design choice rather than a hidden reduction. No equations, fitted parameters, or self-citations are exhibited in the provided text that would make the reported gains (success rate, resolve efficiency, cost reduction) equivalent to the inputs by construction. The empirical outcomes are described as results of applying the framework, and the argument remains self-contained once the closed-loop premise is granted. No load-bearing step reduces to a self-definition or fitted input in the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5774 in / 1105 out tokens · 43991 ms · 2026-06-28T00:40:10.508338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 12 canonical work pages

  1. [1]

    2025 , eprint=

    From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future , author=. 2025 , eprint=

  2. [2]

    2025 , eprint=

    A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System , author=. 2025 , eprint=

  3. [3]

    2026 , url =

    Position: Humans are Missing from AI Coding Agent Research , author =. 2026 , url =

  4. [4]

    2025 , eprint=

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents , author=. 2025 , eprint=

  5. [5]

    ArXiv , year=

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , author=. ArXiv , year=

  6. [6]

    2025 , url=

    Claude Code , author=. 2025 , url=

  7. [7]

    Annual Meeting of the Association for Computational Linguistics , year=

    CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges , author=. Annual Meeting of the Association for Computational Linguistics , year=

  8. [8]

    ArXiv , year=

    MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution , author=. ArXiv , year=

  9. [9]

    arXiv preprint arXiv:2502.06994 , year=

    SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering , author=. arXiv preprint arXiv:2502.06994 , year=

  10. [10]

    International Conference on Machine Learning , year=

    PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification , author=. International Conference on Machine Learning , year=

  11. [11]

    2025 , url=

    GPT Docs , author=. 2025 , url=

  12. [12]

    2025 , url=

    Claude Docs , author=. 2025 , url=

  13. [13]

    2025 , url=

    GPT API Pricing , author=. 2025 , url=

  14. [14]

    2025 , url=

    Claude API Pricing , author=. 2025 , url=

  15. [15]

    Transactions of the Association for Computational Linguistics , year=

    Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , year=

  16. [16]

    ArXiv , year=

    BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , author=. ArXiv , year=

  17. [17]

    L oo GLE : Can Long-Context Language Models Understand Long Contexts?

    Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan. L oo GLE : Can Long-Context Language Models Understand Long Contexts?. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.859

  18. [18]

    Efficient Solutions For An Intriguing Failure of LLM s: Long Context Window Does Not Mean LLM s Can Analyze Long Sequences Flawlessly

    Hosseini, Peyman and Castro, Ignacio and Ghinassi, Iacopo and Purver, Matthew. Efficient Solutions For An Intriguing Failure of LLM s: Long Context Window Does Not Mean LLM s Can Analyze Long Sequences Flawlessly. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  19. [19]

    2025 , eprint=

    MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation , author=. 2025 , eprint=

  20. [20]

    Lifelong Model Editing with Graph-Based External Memory

    Atri, Yash Kumar and Alaa, Ahmed and Hartvigsen, Thomas. Lifelong Model Editing with Graph-Based External Memory. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.690

  21. [21]

    and Brodsky, Joshua and Mahendra, Ashish and Kang, Yiping and Flautner, Krisztian and Tang, Lingjia and Mars, Jason

    Kashmira, Savini and Dantanarayana, Jayanaka L. and Brodsky, Joshua and Mahendra, Ashish and Kang, Yiping and Flautner, Krisztian and Tang, Lingjia and Mars, Jason. TOBUG raph: Knowledge Graph-Based Retrieval for Enhanced LLM Performance Beyond RAG. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. 202...

  22. [22]

    2024 , eprint=

    M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions , author=. 2024 , eprint=

  23. [23]

    C om RAG : Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry

    Chen, Qinwen and Tao, Wenbiao and Zhu, Zhiwei and Xi, Mingfan and Guo, Liangzhong and Wang, Yuan and Wang, Wei and Lan, Yunshi. C om RAG : Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Indus...

  24. [24]

    H - MEM : Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

    Sun, Haoran and Zeng, Shaoning and Zhang, Bob. H - MEM : Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.15

  25. [25]

    From Knowledge to Noise: CTIM -Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

    Lindenbauer, Tobias and Groh, Georg and Schuetze, Hinrich. From Knowledge to Noise: CTIM -Rover and the Pitfalls of Episodic Memory in Software Engineering Agents. Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025). 2025. doi:10.18653/v1/2025.realm-1.30

  26. [26]

    2024 , eprint=

    CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory , author=. 2024 , eprint=

  27. [27]

    2023 , eprint=

    MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. 2023 , eprint=

  28. [28]

    2025 , eprint=

    MemInsight: Autonomous Memory Augmentation for LLM Agents , author=. 2025 , eprint=

  29. [29]

    2026 , eprint=

    Structurally Aligned Subtask-Level Memory for Software Engineering Agents , author=. 2026 , eprint=

  30. [30]

    2025 , eprint=

    SWE-Bench-CL: Continual Learning for Coding Agents , author=. 2025 , eprint=

  31. [31]

    ArXiv , year=

    Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations , author=. ArXiv , year=

  32. [32]

    ArXiv , year=

    Improving Code Localization with Repository Memory , author=. ArXiv , year=

  33. [33]

    2026 , eprint=

    MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences , author=. 2026 , eprint=

  34. [34]

    M ind R ef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness

    Wang, Ye and Xu, Xinrun and Ding, Zhiming. M ind R ef: Mimicking Human Memory for Hierarchical Reference Retrieval with Fine-Grained Location Awareness. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.67

  35. [35]

    If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?

    Yoshida, Ryo and Isono, Shinnosuke and Kajikawa, Kohei and Someya, Taiga and Sugimoto, Yushi and Oseki, Yohei. If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/...

  36. [36]

    Knowledge Graph-Driven Memory Editing with Directional Interventions

    Fu, Jinhu and Wang, Kun and Guo, Chongye and Fang, Junfeng and Zhang, Wentao and Su, Sen. Knowledge Graph-Driven Memory Editing with Directional Interventions. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.261

  37. [37]

    H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

    Hu, Mengkang and Chen, Tianxing and Chen, Qiguang and Mu, Yao and Shao, Wenqi and Luo, Ping. H i A gent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1575

  38. [38]

    2025 , eprint=

    General Agentic Memory Via Deep Research , author=. 2025 , eprint=

  39. [39]

    2025 , eprint=

    LightMem: Lightweight and Efficient Memory-Augmented Generation , author=. 2025 , eprint=

  40. [40]

    Towards Lifelong Dialogue Agents via Timeline-based Memory Management

    Ong, Kai Tzu-iunn and Kim, Namyoung and Gwak, Minju and Chae, Hyungjoo and Kwon, Taeyoon and Jo, Yohan and Hwang, Seung-won and Lee, Dongha and Yeo, Jinyoung. Towards Lifelong Dialogue Agents via Timeline-based Memory Management. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Hum...

  41. [41]

    2024 , url=

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

  42. [42]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  43. [43]

    2025 , eprint=

    Group Sequence Policy Optimization , author=. 2025 , eprint=

  44. [44]

    2025 , eprint=

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

  45. [45]

    2025 , url=

    Devstral Model , author=. 2025 , url=

  46. [46]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  47. [47]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  48. [48]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  49. [49]

    2025 , url=

    Claude 4 Sonnet , author=. 2025 , url=