arxiv: 2605.12493 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: no theorem link

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Asmi Kawatkar, Bryan Kwan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Zixiang Ji

Pith reviewed 2026-05-13 03:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-term memoryAI agentsweb environmentsmemory benchmarkAgentRunbookcontext gatheringenvironment experiencecoding agent

0 comments

The pith

A coding-agent memory method called AgentRunbook-C reaches 72.5 percent accuracy on a benchmark of 451 questions that test whether agents can internalize environment-specific experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LongMemEval-V2, a benchmark with 451 questions that directly measures how well memory systems help agents acquire the knowledge needed to function as experienced colleagues in customized web environments. Questions target five abilities: recalling static interface states, tracking dynamic changes, knowing standard workflows, recognizing environment-specific pitfalls, and maintaining premise awareness. Histories can reach 500 trajectories and 115 million tokens. Two methods are proposed: an efficient RAG system that maintains separate knowledge pools, and AgentRunbook-C that stores full trajectories as files and lets a coding agent retrieve evidence inside an augmented sandbox. AgentRunbook-C delivers the highest accuracy while still leaving clear headroom and incurring high latency.

Core claim

LongMemEval-V2 uses a context-gathering formulation in which memory systems ingest long history trajectories and return compact evidence for question answering. AgentRunbook-C stores trajectories as files and invokes a coding agent to gather evidence, achieving 72.5 percent average accuracy across the five memory abilities. This outperforms the strongest RAG baseline at 48.5 percent and an off-the-shelf coding-agent baseline at 69.3 percent, advancing the accuracy-latency frontier even though latency remains high and further gains are needed.

What carries the argument

AgentRunbook-C stores trajectories as files and invokes a coding agent to gather evidence inside an augmented sandbox; it replaces pure retrieval with executable code that can inspect and manipulate the stored history.

If this is right

Direct measurement of internalized experience becomes possible without relying solely on downstream task success.
Coding-based retrieval can scale to histories of hundreds of trajectories and over 100 million tokens more effectively than standard RAG.
Accuracy gains from AgentRunbook-C come with substantially higher latency, establishing a new point on the accuracy-latency trade-off.
Current memory methods still leave substantial room for improvement before agents match experienced human colleagues on environment-specific knowledge.
The five abilities provide a structured target for developing future long-term memory systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark proves predictive, it could guide iterative improvement of memory systems before they are deployed in production environments.
The file-plus-coding approach may transfer to other long-horizon agent settings where histories contain structured but noisy interaction traces.
Hybrid systems that combine lightweight retrieval with selective coding queries could reduce latency while preserving most of the accuracy gain.
Wider adoption of such benchmarks would shift evaluation focus from short traces to cumulative experience over hundreds of interactions.

Load-bearing premise

The 451 manually curated questions accurately and comprehensively measure the five core memory abilities that experienced colleagues need in customized web environments.

What would settle it

An experiment in which agents that score high on LongMemEval-V2 still fail to apply the same recalled knowledge during extended live interactions with the actual web environment.

Figures

Figures reproduced from arXiv: 2605.12493 by Asmi Kawatkar, Bryan Kwan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Zixiang Ji.

**Figure 2.** Figure 2: LME-V2 questions cover diverse source domains, question types, and formats. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Basic statistics of history haystacks in LME-V2. The small and medium tiers cover a large [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Pilot studies on LME-V2 non-abstention questions. (a) Frontier LLMs perform poorly [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: An illustration of the AgentRunbook memory modules. (a) AgentRunbook-R digests [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: AgentRunbook improves the query accuracy-latency frontier of memory modules. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Error composition for the main methods on LME-V2-Small and LME-V2-Medium. Each [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Mean number of command executions per query divided by command types. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of grouped command classes during the first eight command rounds. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LME-V2 gives a new benchmark for agent memory focused on environment experience rather than user traces, with AgentRunbook-C showing clear gains, but the 451 questions lack reported validation that could make the 72.5% result hard to interpret.

read the letter

The paper's core contribution is LongMemEval-V2, a benchmark with 451 questions on five abilities—static state recall, dynamic tracking, workflow knowledge, environment gotchas, and premise awareness—for agents in customized web settings. It uses long trajectories up to 500 steps and 115M tokens, then tests memory systems via a context-gathering protocol where the memory module supplies evidence for QA. They introduce AgentRunbook-R (RAG with separate pools for states, events, and notes) and AgentRunbook-C (storing trajectories as files and using a coding agent in a sandbox). AgentRunbook-C reaches 72.5% average accuracy, ahead of the best RAG baseline at 48.5% and an off-the-shelf coding agent at 69.3%. This setup moves evaluation toward whether memory actually internalizes environment-specific experience, which is a useful shift from most existing short-trace or task-success benchmarks. The taxonomy and the two concrete memory methods are the clearest new pieces. The results give a direct accuracy-latency comparison that highlights where coding-based retrieval helps but also costs time. The main weakness is the question set itself. The abstract describes the questions as manually curated and paired with trajectories, but gives no inter-annotator agreement, coverage audit against the five abilities, or checks against curation bias or overfitting to the specific web environments. Without those, the performance gap could partly reflect how the questions were written rather than genuine memory differences. No error bars or significance tests are mentioned either. This is a benchmark paper aimed at researchers building long-term memory for agents or evaluating long-context systems. Anyone working on agent runbooks or environment modeling would find the testbed and the five-ability breakdown worth looking at. The central claim that current methods still have room to improve holds, but the numbers need tighter grounding on question quality to be fully convincing. I would send it to peer review; the new benchmark direction is worth referee time even if the evaluation details need strengthening.

Referee Report

2 major / 1 minor

Summary. The paper introduces LongMemEval-V2 (LME-V2), a benchmark of 451 manually curated questions paired with long agent history trajectories (up to 500 trajectories and 115M tokens) to directly evaluate whether memory systems enable agents to internalize experience as knowledgeable colleagues in customized web environments. It defines five core abilities (static state recall, dynamic state tracking, workflow knowledge, environment gotchas, premise awareness) and proposes two methods: AgentRunbook-R (RAG-based with knowledge pools for states, events, and strategies) and AgentRunbook-C (trajectory storage as files with coding-agent evidence gathering in an augmented sandbox). Experiments report AgentRunbook-C at 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and off-the-shelf coding agent (69.3%), while noting latency costs and room for improvement.

Significance. If the benchmark questions are shown to validly and unbiasedly probe the targeted abilities, LME-V2 supplies a needed direct testbed for long-term memory internalization in specialized environments, moving beyond user-history or task-success proxies. The empirical comparison credibly positions coding-augmented retrieval as advancing the accuracy-latency frontier relative to pure RAG, providing concrete baselines and a challenging scale (115M tokens) for future work on agent memory.

major comments (2)

[Abstract and benchmark construction] Abstract and benchmark description: The central performance claims (AgentRunbook-C at 72.5% vs. 48.5% RAG and 69.3% coding baseline) rest on the premise that the 451 manually curated questions comprehensively and without bias measure the five stated memory abilities. No inter-annotator agreement, coverage audit against the five abilities, adversarial question filtering, or external validation is reported, leaving open the possibility that deltas reflect curation artifacts rather than genuine memory internalization.
[Experiments] Experiments section: The headline accuracy figures are given as point estimates without error bars, statistical significance tests, per-ability breakdowns, or controls for question difficulty variance across environments, which is required to substantiate that AgentRunbook-C meaningfully advances the Pareto frontier.

minor comments (1)

[Abstract] The abstract notes high latency for coding-agent methods but provides no quantitative latency numbers or trade-off curves, which would strengthen the Pareto-frontier claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of benchmark validity and experimental reporting that we address point-by-point below. We have revised the manuscript accordingly where feasible.

read point-by-point responses

Referee: [Abstract and benchmark construction] Abstract and benchmark description: The central performance claims (AgentRunbook-C at 72.5% vs. 48.5% RAG and 69.3% coding baseline) rest on the premise that the 451 manually curated questions comprehensively and without bias measure the five stated memory abilities. No inter-annotator agreement, coverage audit against the five abilities, adversarial question filtering, or external validation is reported, leaving open the possibility that deltas reflect curation artifacts rather than genuine memory internalization.

Authors: We appreciate the referee's emphasis on rigorous benchmark validation. The 451 questions were constructed by the authors according to explicit criteria for each of the five abilities (detailed in Section 3.2 of the manuscript), with each question designed to require internalization of specific history elements that general knowledge or surface-level retrieval cannot provide. In the revised version, we will expand the benchmark construction section to include: (1) the full curation guidelines used by the authors, (2) a coverage table showing the number of questions per ability, and (3) representative examples for each ability. While formal inter-annotator agreement statistics were not computed during initial curation, all questions underwent cross-review by multiple authors to ensure alignment with the target abilities; we will report this process and acknowledge the absence of IAA as a limitation. We will also add a discussion of potential curation artifacts and note that the distinct performance patterns across methods (e.g., AgentRunbook-C excelling on dynamic tracking while RAG lags on workflow knowledge) provide evidence that the questions discriminate based on memory mechanisms rather than artifacts alone. External validation against other benchmarks is planned for future work but is outside the scope of the current submission. revision: partial
Referee: [Experiments] Experiments section: The headline accuracy figures are given as point estimates without error bars, statistical significance tests, per-ability breakdowns, or controls for question difficulty variance across environments, which is required to substantiate that AgentRunbook-C meaningfully advances the Pareto frontier.

Authors: We agree that the experimental section would benefit from greater statistical detail. The reported figures (72.5%, 48.5%, 69.3%) represent mean accuracy across all 451 questions. In the revised manuscript, we will: (1) add bootstrap-derived 95% confidence intervals as error bars for the primary results, (2) include a per-ability performance breakdown (new Table or Figure) showing accuracy for static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness separately, and (3) provide an analysis of performance stratified by trajectory length and environment to control for difficulty variance. We will also report statistical significance of the differences using appropriate tests (e.g., McNemar's test for paired accuracy comparisons). These additions will more clearly demonstrate that AgentRunbook-C advances the accuracy-latency frontier beyond the baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation with no derivations or self-referential reductions.

full rationale

The paper introduces LME-V2 as a new manually-curated benchmark of 451 questions across five memory abilities, pairs them with history trajectories, and reports empirical accuracy numbers for proposed memory methods (AgentRunbook-C at 72.5%) versus baselines. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the derivation chain; the central claims are direct experimental comparisons on the introduced testbed. The validity of the question set is an external concern but does not create circularity within the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that human-curated questions faithfully capture the targeted memory abilities and that the reported accuracy differences reflect genuine method improvements rather than artifacts of question selection.

axioms (1)

domain assumption Manually curated questions can accurately and comprehensively assess the five specified memory abilities for web agents.
The benchmark construction and performance claims depend on the quality and coverage of the 451 questions created by the authors.

pith-pipeline@v0.9.0 · 5640 in / 1422 out tokens · 141687 ms · 2026-05-13T03:43:34.481035+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 9 internal anchors

[1]

arXiv preprint arXiv:2402.13718, 2024b

Xinrong Zhang and Yingfa Chen and Shengding Hu and Zihang Xu and Junhao Chen and Moo Khai Hao and Xu Han and Zhen Leng Thai and Shuo Wang and Zhiyuan Liu and Maosong Sun , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.13718 , eprinttype =. 2402.13718 , timestamp =

work page doi:10.48550/arxiv.2402.13718 2024
[2]

Helmet: How to evaluate long-context language models effectively and thoroughly, 2025

Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izsak and Moshe Wasserblat and Danqi Chen , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02694 , eprinttype =. 2410.02694 , timestamp =

work page doi:10.48550/arxiv.2410.02694 2024
[5]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , booktitle =

Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , editor =. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , booktitle =. 2025 , url =

work page 2025
[6]

Rossi and Seunghyun Yoon and Hinrich Sch

Ali Modarressi and Hanieh Deilamsalehy and Franck Dernoncourt and Trung Bui and Ryan A. Rossi and Seunghyun Yoon and Hinrich Sch. NoLiMa: Long-Context Evaluation Beyond Literal Matching , booktitle =. 2025 , url =

work page 2025
[7]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , booktitle =

Di Wu and Hongwei Wang and Wenhao Yu and Yuwei Zhang and Kai. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , booktitle =. 2025 , url =

work page 2025
[14]

Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Yuanzhe Hu and Yu Wang and Julian J. McAuley , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.05257 , eprinttype =. 2507.05257 , timestamp =

work page doi:10.48550/arxiv.2507.05257 2025
[18]

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model , booktitle =

Mengkang Hu and Tianxing Chen and Qiguang Chen and Yao Mu and Wenqi Shao and Ping Luo , editor =. HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model , booktitle =. 2025 , url =

work page 2025
[25]

CoRR , volume =

Yiting Shen and Kun Li and Wei Zhou and Songlin Hu , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.19935 , eprinttype =. 2601.19935 , timestamp =

work page doi:10.48550/arxiv.2601.19935 2026
[27]

MEM1: learning to synergize memory and reasoning for efficient long-horizon agents.CoRR, abs/2506.15841,2025

Zijian Zhou and Ao Qu and Zhaoxuan Wu and Sunghwan Kim and Alok Prakash and Daniela Rus and Jinhua Zhao and Bryan Kian Hsiang Low and Paul Pu Liang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.15841 , eprinttype =. 2506.15841 , timestamp =

work page doi:10.48550/arxiv.2506.15841 2025
[28]

arXiv preprint arXiv:2509.23040 , year=

Yaorui Shi and Yuxin Chen and Siyuan Wang and Sihang Li and Hengxing Cai and Qi Gu and Xiang Wang and An Zhang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.23040 , eprinttype =. 2509.23040 , timestamp =

work page doi:10.48550/arxiv.2509.23040 2025
[29]

A Memory Model for Question Answering from Streaming Data Supported by Rehearsal and Anticipation of Coreference Information , booktitle =

Vladimir Araujo and Alvaro Soto and Marie. A Memory Model for Question Answering from Streaming Data Supported by Rehearsal and Anticipation of Coreference Information , booktitle =. 2023 , url =. doi:10.18653/V1/2023.FINDINGS-ACL.830 , timestamp =

work page doi:10.18653/v1/2023.findings-acl.830 2023
[30]

Sengamedu , editor =

Yue Zhou and Xiaobo Guo and Belhassen Bayar and Srinivasan H. Sengamedu , editor =. Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning , booktitle =. 2026 , url =

work page 2026
[34]

CoRR , volume =

Chenghao Xiao and Isaac Chung and Imene Kerboua and Jamie Stirling and Xin Zhang and M. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.10471 , eprinttype =. 2504.10471 , timestamp =

work page doi:10.48550/arxiv.2504.10471 2025
[35]

The Thirteenth International Conference on Learning Representations,

Ziyan Jiang and Rui Meng and Xinyi Yang and Semih Yavuz and Yingbo Zhou and Wenhu Chen , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

work page 2025
[39]

Laradji and Manuel Del Verme and Tom Marty and David V

Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and David V. WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? , booktitle =. 2024 , url =

work page 2024
[40]

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks , booktitle =

L. WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks , booktitle =. 2024 , url =

work page 2024
[41]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[42]

Agent Workflow Memory , booktitle =

Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , editor =. Agent Workflow Memory , booktitle =. 2025 , url =

work page 2025
[45]

2026 , eprint=

CocoaBench: Evaluating Unified Digital Agents in the Wild , author=. 2026 , eprint=

work page 2026
[46]

The Tenth International Conference on Learning Representations,

Yuhuai Wu and Markus Norman Rabe and DeLesley Hutchins and Christian Szegedy , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

work page 2022
[47]

McAuley , editor =

Yu Wang and Yifan Gao and Xiusi Chen and Haoming Jiang and Shiyang Li and Jingfeng Yang and Qingyu Yin and Zheng Li and Xian Li and Bing Yin and Jingbo Shang and Julian J. McAuley , editor =. Forty-first International Conference on Machine Learning,. 2024 , url =

work page 2024
[48]

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Trans. Mach. Learn. Res. , volume =. 2024 , url =

work page 2024
[52]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[53]

2025 , howpublished =

work page 2025
[54]

2026 , howpublished =

work page 2026
[55]

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

work page
[56]

2023 , howpublished =

Gregory Kamradt , title =. 2023 , howpublished =

work page 2023
[57]

Claude Opus 4.6 System Card

Anthropic . Claude Opus 4.6 System Card . https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf, 2026. Accessed: 2026-04-23

work page 2026
[58]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual...

work page 2025
[59]

Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks

L \' e o Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier de Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. T...

work page 2024
[60]

Test Intention Guided LLM-Based Unit Test Generation ,

Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. Repairagent: An autonomous, llm-based agent for program repair. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025 , pages 2188--2200. IEEE , 2025. doi:10.1109/ICSE55347.2025.00157. URL https://doi.org/10.1109/ICSE55347.2025.00157

work page doi:10.1109/icse55347.2025.00157 2025
[61]

Coding agents are effective long-context processors

Weili Cao, Xunjian Yin, Bhuwan Dhingra, and Shuyan Zhou. Coding agents are effective long-context processors. CoRR, abs/2603.20432, 2026. doi:10.48550/ARXIV.2603.20432. URL https://doi.org/10.48550/arXiv.2603.20432

work page doi:10.48550/arxiv.2603.20432 2026
[62]

Mem0: Building production-ready AI agents with scalable long-term memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory. In In \^ e s Lynce, Nello Murano, Mauro Vallati, Serena Villata, Federico Chesani, Michela Milano, Andrea Omicini, and Mehdi Dastani, editors, ECAI 2025 - 28th European Conference on Artificial Intelligenc...

work page doi:10.3233/faia251160 2025
[63]

Cl-bench: A benchmark for context learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu - Gang Jiang, Di Wang, and Shunyu Yao. Cl-bench: A benchm...

work page doi:10.48550/arxiv.2602.03587 2026
[64]

Laradji, Manuel Del Verme, Tom Marty, David V \' a zquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David V \' a zquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berken...

work page 2024
[65]

Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering, arXiv preprint arXiv:2402.16288, 2024

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam - Fai Wong. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering. CoRR, abs/2402.16288, 2024. doi:10.48550/ARXIV.2402.16288. URL https://doi.org/10.48550/arXiv.2402.16288

work page doi:10.48550/arxiv.2402.16288 2024
[66]

Agentlongbench: A controllable long benchmark for long-contexts agents via environment rollouts

Shicheng Fang, Yuxin Wang, Xiaoran Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, and Xipeng Qiu. Agentlongbench: A controllable long benchmark for long-contexts agents via environment rollouts. CoRR, abs/2601.20730, 2026. doi:10.48550/ARXIV.2601.20730. URL https://doi.org/10.48550/arXiv.2601.20730

work page doi:10.48550/arxiv.2601.20730 2026
[67]

Gemini 3 Pro Model Card

Google DeepMind . Gemini 3 Pro Model Card . https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf, 2025. Accessed: 2026-04-23

work page 2025
[68]

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu - Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian J. McAuley, Yejin Choi, and Alex Pentland. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks. CoRR, abs/2602.16313, 2026. doi:10.48550/ARXIV.2602.16313. URL https://doi.org/10.48550/a...

work page doi:10.48550/arxiv.2602.16313 2026
[69]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng - Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: what's the real context size of your long-context language models? CoRR, abs/2404.06654, 2024. doi:10.48550/ARXIV.2404.06654. URL https://doi.org/10.48550/arXiv.2404.06654

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.06654 2024
[70]

Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model

Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Ling...

work page 2025
[71]

Ungar, Camillo J

Bowen Jiang, Zhuoqun Hao, Young - Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle H. Ungar, Camillo J. Taylor, and Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale. CoRR, abs/2504.14225, 2025 a . doi:10.48550/ARXIV.2504.14225. URL https://doi.org/10.48550/arXiv.2504.14225

work page doi:10.48550/arxiv.2504.14225 2025
[72]

Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, Radha Poovendran, Gregory W. Wornell, Lyle H. Ungar, Dan Roth, Sihao Chen, and Camillo Jose Taylor. Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory. CoRR, abs/2512.06688, ...

work page doi:10.48550/arxiv.2512.06688 2025
[73]

Needle in a haystack: Pressure testing llms

Gregory Kamradt. Needle in a haystack: Pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023. GitHub repository

work page 2023
[74]

One thousand and one pairs: A "novel" challenge for long-context language models

Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer. One thousand and one pairs: A "novel" challenge for long-context language models. In Yaser Al - Onaizan, Mohit Bansal, and Yun - Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 20...

work page doi:10.18653/v1/2024.emnlp-main.948 2024
[75]

Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents

Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents. CoRR, abs/2406.13144, 2024. doi:10.48550/ARXIV.2406.13144. URL https://doi.org/10.48550/arXiv.2406.13144

work page doi:10.48550/arxiv.2406.13144 2024
[76]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626, 2023

work page 2023
[77]

Emembench: Interactive benchmarking of episodic memory for vlm agents, 2026

Xinze Li, Ziyue Zhu, Siyuan Liu, Yubo Ma, Yuhang Zang, Yixin Cao, and Aixin Sun. Emembench: Interactive benchmarking of episodic memory for VLM agents. CoRR, abs/2601.16690, 2026. doi:10.48550/ARXIV.2601.16690. URL https://doi.org/10.48550/arXiv.2601.16690

work page doi:10.48550/arxiv.2601.16690 2026
[78]

Sleep-time compute: Beyond inference scaling at test-time

Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, and Joseph E. Gonzalez. Sleep-time compute: Beyond inference scaling at test-time. CoRR, abs/2504.13171, 2025. doi:10.48550/ARXIV.2504.13171. URL https://doi.org/10.48550/arXiv.2504.13171

work page doi:10.48550/arxiv.2504.13171 2025
[79]

FileGram: Grounding Agent Personalization in File-System Behavioral Traces

Shuai Liu, Shulin Tian, Kairui Hu, Yuhao Dong, Zhe Yang, Bo Li, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Filegram: Grounding agent personalization in file-system behavioral traces. arXiv preprint arXiv:2604.04901, 2026 a

work page internal anchor Pith review Pith/arXiv arXiv 2026
[80]

The pensieve paradigm: Stateful language models mastering their own context, 2026

Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, and Yan Wang. The pensieve paradigm: Stateful language models mastering their own context. CoRR, abs/2602.12108, 2026 b . doi:10.48550/ARXIV.2602.12108. URL https://doi.org/10.48550/arXiv.2602.12108

work page doi:10.48550/arxiv.2602.12108 2026
[81]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong - Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun - Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, T...

work page doi:10.18653/v1/2024.acl-long.747 2024
[82]

Rossi, Seunghyun Yoon, and Hinrich Sch \" u tze

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Sch \" u tze. Nolima: Long-context evaluation beyond literal matching. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference...

work page 2025
[83]

Codex CLI

OpenAI . Codex CLI . https://github.com/openai/codex. Open-source coding agent software. Accessed: 2026-04-24

work page 2026
[84]

Update to GPT-5 System Card: GPT-5.2

OpenAI . Update to GPT-5 System Card: GPT-5.2 . https://openai.com/index/gpt-5-system-card-update-gpt-5-2/, 2025. Accessed: 2026-04-23

work page 2025
[85]

OpenAI GPT-5 System Card

OpenAI . GPT-5 System Card , 2026. URL https://arxiv.org/abs/2601.03267

work page internal anchor Pith review Pith/arXiv arXiv 2026
[86]

OpenRouter

OpenRouter . OpenRouter . https://openrouter.ai/, 2026. AI model routing and inference API. Accessed: 2026-04-24

work page 2026
[87]

Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen - Yu Lee, and Tomas Pfister

Siru Ouyang, Jun Yan, I - Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen - Yu Lee, and Tomas Pfister. Reasoningbank: Scaling agent self-evolving with reasoning memory. CoRR, abs/2509.25140, 2025. doi:10.48550/ARXIV.2509.25140....

work page doi:10.48550/arxiv.2509.25140 2025
[88]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems. CoRR, abs/2310.08560, 2023. doi:10.48550/ARXIV.2310.08560. URL https://doi.org/10.48550/arXiv.2310.08560

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08560 2023
[89]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[90]

Chemagent: Self-updating library in large language models improves chemical reasoning

Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, and Mark Gerstein. Chemagent: Self-updating library in large language models improves chemical reasoning. CoRR, abs/2501.06590, 2025. doi:10.48550/ARXIV.2501.06590. URL https://doi.org/10.48550/arXiv.2501.06590

work page doi:10.48550/arxiv.2501.06590 2025
[91]

Ross Mitchell

Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, and J. Ross Mitchell. Beyond a million tokens: Benchmarking and enhancing long-term memory in llms. CoRR, abs/2510.27246, 2025. doi:10.48550/ARXIV.2510.27246. URL https://doi.org/10.48550/arXiv.2510.27246

work page doi:10.48550/arxiv.2510.27246 2025
[92]

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian M...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[93]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Trans. Mach. Learn. Res., 2024, 2024 a . URL https://openreview.net/forum?id=ehfRiF0R3a

work page 2024
[94]

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian J. McAuley. MEMORYLLM: towards self-updatable large language models. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-firs...

work page 2024
[95]

URL https://openreview.net/forum?id=ehfRiF0R3a

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian J. McAuley, and Xiaojian Wu. Mem- \( \) : Learning memory construction via reinforcement learning. CoRR, abs/2509.25911, 2025 a . doi:10.48550/ARXIV.2509.25911. URL https://doi.org/10.48550/arXiv.2509.25911

work page doi:10.48550/arxiv.2509.25911 2025
[96]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste - Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors, Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 , Proceedings of Machi...

work page 2025
[97]

Longmemeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai - Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025 a . URL https://openreview.net/forum?id=pZiyCaVuti

work page 2025
[98]

Auto-scaling continuous memory for GUI agent

Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, and Biwei Huang. Auto-scaling continuous memory for GUI agent. CoRR, abs/2510.09038, 2025 b . doi:10.48550/ARXIV.2510.09038. URL https://doi.org/10.48550/arXiv.2510.09038

work page doi:10.48550/arxiv.2510.09038 2025
[99]

Memorizing transformers

Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=TrjbxzRcnf-

work page 2022
[100]

Grok 4.1 Model Card

xAI . Grok 4.1 Model Card . https://data.x.ai/2025-11-17-grok-4-1-model-card.pdf, 2025. Accessed: 2026-04-23

work page 2025
[101]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: agentic memory for LLM agents. CoRR, abs/2502.12110, 2025. doi:10.48550/ARXIV.2502.12110. URL https://doi.org/10.48550/arXiv.2502.12110

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.12110 2025
[102]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch \" u tze, Volker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning. CoRR, abs/2508.19828, 2025. doi:10.48550/ARXIV.2508.19828. URL https://doi.org/10.48550/arXiv.2508.19828

work page internal anchor Pith review doi:10.48550/arxiv.2508.19828 2025
[103]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/forum?id=WE\_vluYUL-X

work page 2023
[104]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents. CoRR, abs/2602.02474, 2026. doi:10.48550/ARXIV.2602.02474. URL https://doi.org/10.48550/arXiv.2602.02474

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02474 2026
[105]

Expel: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong - Jin Liu, and Gao Huang. Expel: LLM agents are experiential learners. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2...

work page doi:10.1609/aaai.v38i17.29936 2024
[106]

LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, and Min Zhang. LMEB: long-horizon memory embedding benchmark. CoRR, abs/2603.12572, 2026 a . doi:10.48550/ARXIV.2603.12572. URL https://doi.org/10.48550/arXiv.2603.12572

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.12572 2026
[107]

Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. Ama-bench: Evaluating long-horizon memory for agentic applications. CoRR, abs/2602.22769, 2026 b . doi:10.48550/ARXIV.2602.22769. URL https://doi.org/10.48550/arXiv.2602.22769

work page doi:10.48550/arxiv.2602.22769 2026
[108]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net,...

work page 2024