{"paper":{"title":"Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Julian McAuley, Yuanzhe Hu, Yu Wang","submitted_at":"2025-07-07T17:59:54Z","abstract_excerpt":"Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. E"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the four competencies drawn from memory science are the complete and essential set for memory agents, and that transforming static long-context datasets into incremental multi-turn interactions preserves the original properties needed to measure those competencies.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods fall short.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7c860269ede4ba0ecc9ecdc67ba5b6d03c98d14e1599fd412abd0a680bae8a4c"},"source":{"id":"2507.05257","kind":"arxiv","version":3},"verdict":{"id":"a84f121f-9135-4fcb-995a-aab96810c675","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T21:17:36.128863Z","strongest_claim":"Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.","one_line_summary":"MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods fall short.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the four competencies drawn from memory science are the complete and essential set for memory agents, and that transforming static long-context datasets into incremental multi-turn interactions preserves the original properties needed to measure those competencies.","pith_extraction_headline":"A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science."},"references":{"count":63,"sample":[{"doi":"","year":null,"title":"LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding","work_id":"ba7831c4-9427-4e0e-a5c1-4e98511f4b53","ref_index":1,"cited_arxiv_id":"2308.14508","is_internal_anchor":true},{"doi":"","year":null,"title":"LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks","work_id":"9fac250b-241e-41ce-9177-469deaf03040","ref_index":2,"cited_arxiv_id":"2412.15204","is_internal_anchor":true},{"doi":"","year":null,"title":"arXiv preprint arXiv:2405.00200 , year=","work_id":"16af75a7-d946-4ab5-a85c-1a423767113b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/2020.nlp4convai-1.5","year":2020,"title":"Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory","work_id":"a5aed26c-a248-48b6-a59e-f7693fcb180a","ref_index":4,"cited_arxiv_id":"2504.19413","is_internal_anchor":true},{"doi":"","year":2026,"title":"11 Published as a conference paper at ICLR 2026 DeepMind. Gemini pro,","work_id":"80ebdb1d-bb0a-4129-b956-a46c2f2142ee","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":63,"snapshot_sha256":"ac5979b07b927edb71b16f08b83e41bbc5f15aec38695c164a4aad9230367b3a","internal_anchors":15},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}