arxiv: 2605.08374 · v2 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Bo Tang, Feiyu Xiong, Haoting Shi, Jiaqian Wang, Junwei Liao, Muning Wen, Ruiwen Zhou, Shengtao Zhang, Weinan Zhang, Wei Zhang, Ying Wen, Zhiyu Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords Q-learningprovenance DAGepisodic memoryLLM agentsTD lambdacredit assignmentself-evolving memoryExogenous-Context MDP

0 comments

The pith

By propagating Q-learning credit along provenance DAGs, MemQ enables LLM agents to learn from memory dependency chains rather than isolated experiences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current memory methods for LLM agents evaluate each memory in isolation, missing how one memory helps create future ones. MemQ records these creation dependencies in a directed acyclic graph and applies eligibility traces to spread credit backward along the paths. The credit decays based on graph distance instead of time steps. This setup is modeled as an Exogenous-Context MDP that separates external tasks from internal memory evolution. Experiments across six benchmarks show consistent gains, especially in complex multi-step scenarios.

Core claim

The central claim is that applying TD(λ) eligibility traces to memory Q-values over a provenance DAG improves agent success rates. The DAG captures which memories were used to create new ones, allowing structural proximity to guide credit assignment with decay (γλ)^d. This replaces independent memory updates and leads to superior performance in generalization and online learning on tasks from OS interaction to expert QA.

What carries the argument

The provenance DAG recording dependency chains between memories, combined with TD(λ) eligibility traces applied to memory Q-values for credit propagation based on structural depth.

Load-bearing premise

The provenance DAG accurately captures the dependency chains through which memories enable the creation of future memories, making structural proximity a valid substitute for temporal credit assignment.

What would settle it

An ablation study that randomizes the edges in the provenance DAG while keeping the same memories and retrievals would eliminate the performance gains if the DAG structure is key to the improvement.

Figures

Figures reproduced from arXiv: 2605.08374 by Bo Tang, Feiyu Xiong, Haoting Shi, Jiaqian Wang, Junwei Liao, Muning Wen, Ruiwen Zhou, Shengtao Zhang, Weinan Zhang, Wei Zhang, Ying Wen, Zhiyu Li.

**Figure 2.** Figure 2: The EC-MDP. The state factors into an exogenous task stream [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: MemQ Framework Overview. The continuous learning loop features three stages: Retrieve: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate under different γ. provenance DAG: the effective credit reach is governed by (γλ) d , where d is the DAG depth (Eq. 6). Yet γ and λ play fundamentally different roles: γ controls the structural horizon by weighting the bootstrap target γQ(mnew) (Eq. 5), while λ controls the empirical horizon by decaying how far each observed TD error propagates (Eq. 6). We sweep each hyperparameter individuall… view at source ↗

**Figure 5.** Figure 5: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Runtime learning dynamics (success rate vs. epoch) across six benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative success rate (CSR) over epochs across six benchmarks, complementing the [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: TD error under different γ on LiveCodeBench [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: TD error under different γ on BFCL. B.2 λ Ablation on BFCL [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: SR (top row) and TD error (bottom row) under different [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($\lambda$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(\gamma\lambda)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $\gamma$ and $\lambda$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemQ shows a workable way to assign credit to interdependent memories in LLM agents using eligibility traces along provenance DAGs.

read the letter

The one thing to know is that MemQ uses TD(lambda) eligibility traces on Q-values for memories, but follows the provenance DAG to decide how far back the credit goes. This replaces time-based decay with depth in the creation graph. The paper does a clean job of formalizing the setting as an Exogenous-Context MDP. That factorization lets them treat memory updates as endogenous while the tasks come from outside. Applying the traces with decay (gamma lambda)^d based on DAG depth d is a natural move once you have the graph. They report that this leads to the best results on all six benchmarks tested, with the biggest lifts on the multi-step problems that have longer chains. The empirical pattern is encouraging because the advantage grows with the relevance and depth of the provenance, which is what the method is designed to exploit. The gains reach +5.7 points on those tasks versus +0.77 on simple ones. Having the code out means others can check the implementation details. Where it could be stronger is in validating the DAG itself. The method assumes the recorded dependencies are accurate enough for credit to flow correctly. If retrieval logs miss indirect influences or include irrelevant ones, the propagation might not match reality. They don't appear to have extensive sensitivity tests on that. The parameter choices for gamma and lambda are studied, but like most RL work, some tuning is involved. This work is for researchers focused on making LLM agents learn from extended histories rather than isolated episodes. The idea is straightforward enough and the support lines up well enough that it should go to peer review for a closer look at the experiments and any edge cases in the DAG construction.

Referee Report

1 major / 3 minor

Summary. The paper introduces MemQ, a method that augments episodic memory in LLM agents by applying TD(λ) eligibility traces to memory Q-values, with credit propagated backward along a provenance DAG that records retrieval dependencies at memory creation time. Credit decays as (γλ)^d where d is DAG depth, replacing temporal distance. The setting is formalized as an Exogenous-Context MDP (EC-MDP) whose factored transitions separate the exogenous task stream from the endogenous memory store. On six benchmarks (OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, expert QA), MemQ reports the highest success rates in both generalization and runtime learning evaluations, with the largest gains (+5.7 pp) on multi-step tasks producing deep relevant chains and the smallest (+0.77 pp) on single-step tasks.

Significance. If the reported ordering and differential gains hold under controlled conditions, the work supplies a concrete mechanism for credit assignment across memory dependency chains rather than treating memories independently. The alignment between gain magnitude and provenance depth provides direct empirical support for the structural-propagation hypothesis. Public code release is a clear strength that enables verification and extension.

major comments (1)

[§4] §4 (Experiments): the central claim attributes performance gains to TD(λ) over the provenance DAG, yet the manuscript does not report whether the DAG construction procedure (including retrieval logging) is applied identically to all baselines or only to MemQ; if the latter, the comparison confounds the credit-propagation mechanism with differences in memory-graph construction.

minor comments (3)

[Abstract, §3.1] Abstract and §3.1: the EC-MDP factorization is presented as decoupling exogenous and endogenous components, but the text does not explicitly state whether memory retrieval can alter the exogenous task stream within a single step; a one-sentence clarification would remove ambiguity.
[§5] §5 (Parameter study): the interaction plots for γ and λ are useful, but the manuscript should add a short table reporting the exact (γ, λ) pairs used for the main results on each benchmark to aid reproducibility.
[§4.2, figures] Figure captions and §4.2: several success-rate tables list absolute percentages without standard deviations or number of runs; adding these would strengthen the reported ordering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the credit-assignment contribution, and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central claim attributes performance gains to TD(λ) over the provenance DAG, yet the manuscript does not report whether the DAG construction procedure (including retrieval logging) is applied identically to all baselines or only to MemQ; if the latter, the comparison confounds the credit-propagation mechanism with differences in memory-graph construction.

Authors: The provenance DAG construction (including retrieval logging at memory creation) is an integral and MemQ-specific component; it is not applied to any baseline. All methods share an identical episodic memory buffer, embedding-based retrieval interface, and memory-creation pipeline. The only difference is that MemQ additionally records provenance edges and applies TD(λ) updates along them, while baselines follow their original independent-memory update rules (standard TD(0) or no eligibility traces). This isolates the structural credit-propagation mechanism. We will revise §4 and the experimental appendix to explicitly document the shared memory interface, confirm that retrieval logging occurs uniformly, and state that the DAG is MemQ-only. If desired, we can also add a controlled ablation in which baselines receive a dummy DAG without credit propagation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces MemQ by formalizing an Exogenous-Context MDP and applying standard TD(λ) eligibility traces over a provenance DAG as modeling choices, then reports empirical success rates on six benchmarks. No equation or claim reduces by construction to a fitted parameter renamed as prediction, no self-citation chain is invoked to justify uniqueness or load-bearing assumptions, and the differential gains are presented as direct experimental outcomes rather than tautological redefinitions. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The method rests on two modeling inventions (provenance DAG and Exogenous-Context MDP) plus the standard RL parameters gamma and lambda; no machine-checked proofs or external benchmarks beyond the six reported tasks are mentioned.

free parameters (2)

gamma
Discount factor in the TD(lambda) update; interacts with lambda and DAG depth.
lambda
Eligibility trace decay parameter; controls how far credit propagates along the DAG.

axioms (1)

domain assumption The setting can be formalized as an Exogenous-Context MDP whose factored transition decouples the exogenous task stream from the endogenous memory store.
Invoked to justify the credit-propagation scheme; appears in the abstract description of the formalization.

invented entities (2)

Provenance DAG no independent evidence
purpose: Records which memories were retrieved when each new memory was created, enabling structural credit propagation.
New data structure introduced to replace temporal distance with DAG depth.
Exogenous-Context MDP (EC-MDP) no independent evidence
purpose: Provides a factored MDP formulation that separates task dynamics from memory dynamics.
New modeling construct used to ground the eligibility-trace application.

pith-pipeline@v0.9.0 · 5576 in / 1431 out tokens · 48572 ms · 2026-05-13T06:58:27.138808+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Large Language Models Are Semi-Parametric Reinforcement Learning Agents , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[2]

2026 , eprint=

Memento 2: Learning by Stateful Reflective Memory , author=. 2026 , eprint=

work page 2026
[3]

2026 , eprint=

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory , author=. 2026 , eprint=

work page 2026
[4]

O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , title =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2023 , isbn =. doi:10.1145/3586183.3606763 , abstract =

work page doi:10.1145/3586183.3606763 2023
[5]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

work page 2024
[6]

Transactions on Machine Learning Research , issn=

Cognitive Architectures for Language Agents , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[7]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[8]

Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =....

work page doi:10.1609/aaai.v38i17.29936 2024
[9]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[10]

Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =. 2024 , isbn =. doi:10....

work page doi:10.1609/aaai.v38i17.29946 2024
[11]

2023 , eprint=

Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory , author=. 2023 , eprint=

work page 2023
[12]

2023 , eprint=

RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models , author=. 2023 , eprint=

work page 2023
[13]

2016 , eprint=

Model-Free Episodic Control , author=. 2016 , eprint=

work page 2016
[14]

Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =

Lin, Zichuan and Zhao, Tianqi and Yang, Guangwen and Zhang, Lintao , title =. Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =. 2018 , isbn =

work page 2018
[15]

and Singh, Satinder P

Kearns, Michael J. and Singh, Satinder P. , title =. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory , pages =. 2000 , isbn =

work page 2000
[16]

2023 , eprint=

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants , author=. 2023 , eprint=

work page 2023
[17]

and Singh, Satinder P

Jaakkola, Tommi and Jordan, Michael I. and Singh, Satinder P. , title =. Neural Comput. , month = nov, pages =. 1994 , issue_date =. doi:10.1162/neco.1994.6.6.1185 , abstract =

work page doi:10.1162/neco.1994.6.6.1185 1994
[18]

Asynchronous Stochastic Approximation and

Tsitsiklis, John N , journal=. Asynchronous Stochastic Approximation and

work page
[19]

Borkar, V. S. and Meyn, S. P. , title =. SIAM J. Control Optim. , month = jan, pages =. 2000 , issue_date =. doi:10.1137/S0363012997331639 , abstract =

work page doi:10.1137/s0363012997331639 2000
[20]

Neuro-Dynamic Programming , author=

work page
[21]

and Van Roy, B

Tsitsiklis, J.N. and Van Roy, B. , journal=. An analysis of temporal-difference learning with function approximation , year=

work page
[22]

The Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. The Annals of Mathematical Statistics , volume=

work page
[23]

Machine Learning , volume=

Learning to Predict by the Methods of Temporal Differences , author=. Machine Learning , volume=

work page
[24]

and Sutton, Richard S

Singh, Satinder P. and Sutton, Richard S. , title =. 1996 , issue_date =. doi:10.1007/BF00114726 , journal =

work page doi:10.1007/bf00114726 1996
[25]

Reinforcement Learning: An Introduction , author=

work page
[26]

Proceedings of the 31st International Conference on Machine Learning , pages =

True Online TD(lambda) , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =

work page 2014
[27]

Learning from Delayed Rewards , author=

work page
[28]

Incremental Multi-Step

Peng, Jing and Williams, Ronald J , journal=. Incremental Multi-Step

work page
[29]

Advances in Neural Information Processing Systems , volume=

Safe and Efficient Off-Policy Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

Advances in Neural Information Processing Systems , volume=

Reconciling -Returns with Experience Replay , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

International Conference on Learning Representations , year=

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations , year=

work page
[32]

International Conference on Machine Learning , year=

Espeholt, Lasse and Soyer, Hubert and Munos, R. International Conference on Machine Learning , year=

work page
[33]

2014 , eprint=

Neural Turing Machines , author=. 2014 , eprint=

work page 2014
[34]

Proceedings of the 34th International Conference on Machine Learning , pages =

Neural Episodic Control , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[35]

International Conference on Learning Representations , year=

Prioritized Experience Replay , author=. International Conference on Learning Representations , year=

work page
[36]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020
[37]

Yu Wang and Ryuichi Takanobu and Zhiqi Liang and Yuzhen Mao and Yuanzhe Hu and Julian McAuley and Xiaojian Wu , year=. Mem-. 2509.25911 , archivePrefix=

work page arXiv
[38]

arXiv preprint arXiv:2505.00000 , year=

Yan, Sikuan and Yang, Xiufeng and Huang, Zuchao and Nie, Ercong and Ding, Zifeng and Li, Zonggen and Ma, Xiaowen and Bi, Jinhe and Kersting, Kristian and Pan, Jeff Z and Sch. arXiv preprint arXiv:2505.00000 , year=

work page arXiv
[39]

2026 , eprint=

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks , author=. 2026 , eprint=

work page 2026
[40]

2026 , eprint=

Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management , author=. 2026 , eprint=

work page 2026
[41]

2026 , eprint=

MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards , author=. 2026 , eprint=

work page 2026
[42]

2026 , eprint=

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback , author=. 2026 , eprint=

work page 2026
[43]

2026 , eprint=

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization , author=. 2026 , eprint=

work page 2026
[44]

2026 , eprint=

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation , author=. 2026 , eprint=

work page 2026
[45]

The Thirteenth International Conference on Learning Representations , year=

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[46]

, booktitle =

Patil, Shishir G and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie Cheng-Jie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The Berkeley Function Calling Leaderboard (. 2025 , editor =

work page 2025
[47]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

work page 2024
[48]

2026 , publisher =

Gemma Team, Google , title =. 2026 , publisher =

work page 2026
[49]

2025 , eprint=

Gemini Robotics: Bringing AI into the Physical World , author=. 2025 , eprint=

work page 2025
[50]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-. 2024 , url=

work page 2024
[51]

2026 , eprint=

Memp: Exploring Agent Procedural Memory , author=. 2026 , eprint=

work page 2026
[52]

2025 , eprint=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

work page 2025
[53]

2025 , eprint=

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs , author=. 2025 , eprint=

work page 2025
[54]

2026 , eprint=

What Deserves Memory: Adaptive Memory Distillation for LLM Agents , author=. 2026 , eprint=

work page 2026
[55]

2026 , eprint=

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. 2026 , eprint=

work page 2026
[56]

2026 , eprint=

Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback , author=. 2026 , eprint=

work page 2026
[57]

2026 , eprint=

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents , author=. 2026 , eprint=

work page 2026
[58]

2026 , eprint=

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution , author=. 2026 , eprint=

work page 2026
[59]

2025 , eprint=

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

work page 2025
[60]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[61]

2026 , eprint=

Mem-T: Densifying Rewards for Long-Horizon Memory Agents , author=. 2026 , eprint=

work page 2026
[62]

2025 , eprint=

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory , author=. 2025 , eprint=

work page 2025
[63]

The Fourteenth International Conference on Learning Representations , year=

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[64]

2025 , eprint=

MemEvolve: Meta-Evolution of Agent Memory Systems , author=. 2025 , eprint=

work page 2025
[65]

MemSearcher: Training

Qianhao Yuan and Jie Lou and Zichao Li and Jiawei Chen and Yaojie Lu and Hongyu Lin and Le Sun and Debing Zhang and Xianpei Han , year=. MemSearcher: Training

work page
[66]

and Niranjan, Mahesan , year =

Rummery, G. and Niranjan, Mahesan , year =. On-Line Q-Learning Using Connectionist Systems , journal =

work page
[67]

2025 , eprint=

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners , author=. 2025 , eprint=

work page 2025
[68]

Gonzalez , booktitle=

Shishir G Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng-Jie Ji and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (. 2025 , url=

work page 2025
[69]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[70]

MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham. MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational L...

work page doi:10.18653/v1/2025.acl-long.736 2025