{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:IPIU45KI5Y5THLIDAAODW5VMD2","short_pith_number":"pith:IPIU45KI","schema_version":"1.0","canonical_sha256":"43d14e7548ee3b33ad03001c3b76ac1e8913fe2aa03e3d2c7b29b25761351ca7","source":{"kind":"arxiv","id":"2507.05257","version":3},"attestation_state":"computed","paper":{"title":"Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Julian McAuley, Yuanzhe Hu, Yu Wang","submitted_at":"2025-07-07T17:59:54Z","abstract_excerpt":"Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. E"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2507.05257","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2025-07-07T17:59:54Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"a140b706cb55ff33ce6a93ec468408a531bbaab950f09d3b67bb9b418811dac5","abstract_canon_sha256":"d5575774f38f003f816bf127894567467356de2653671037fbbfcbeba78a730e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.540022Z","signature_b64":"YfjwPrwB/SwqxHDX3ir2LwvTl4pHUHb/Xe8bMovIiWUcHZbD8lYJhQtMkW/sm0pulsrmpCl+jpO98DDUfMxhBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"43d14e7548ee3b33ad03001c3b76ac1e8913fe2aa03e3d2c7b29b25761351ca7","last_reissued_at":"2026-05-17T23:38:46.539410Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.539410Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Julian McAuley, Yuanzhe Hu, Yu Wang","submitted_at":"2025-07-07T17:59:54Z","abstract_excerpt":"Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. E"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the four competencies drawn from memory science are the complete and essential set for memory agents, and that transforming static long-context datasets into incremental multi-turn interactions preserves the original properties needed to measure those competencies.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods fall short.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7c860269ede4ba0ecc9ecdc67ba5b6d03c98d14e1599fd412abd0a680bae8a4c"},"source":{"id":"2507.05257","kind":"arxiv","version":3},"verdict":{"id":"a84f121f-9135-4fcb-995a-aab96810c675","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T21:17:36.128863Z","strongest_claim":"Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.","one_line_summary":"MemoryAgentBench is a new multi-turn benchmark assessing four memory competencies in LLM agents—accurate retrieval, test-time learning, long-range understanding, and selective forgetting—showing that existing methods fall short.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the four competencies drawn from memory science are the complete and essential set for memory agents, and that transforming static long-context datasets into incremental multi-turn interactions preserves the original properties needed to measure those competencies.","pith_extraction_headline":"A new benchmark shows current LLM memory agents fall short on four core competencies from cognitive science."},"references":{"count":63,"sample":[{"doi":"","year":null,"title":"LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding","work_id":"ba7831c4-9427-4e0e-a5c1-4e98511f4b53","ref_index":1,"cited_arxiv_id":"2308.14508","is_internal_anchor":true},{"doi":"","year":null,"title":"LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks","work_id":"9fac250b-241e-41ce-9177-469deaf03040","ref_index":2,"cited_arxiv_id":"2412.15204","is_internal_anchor":true},{"doi":"","year":null,"title":"arXiv preprint arXiv:2405.00200 , year=","work_id":"16af75a7-d946-4ab5-a85c-1a423767113b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/2020.nlp4convai-1.5","year":2020,"title":"Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory","work_id":"a5aed26c-a248-48b6-a59e-f7693fcb180a","ref_index":4,"cited_arxiv_id":"2504.19413","is_internal_anchor":true},{"doi":"","year":2026,"title":"11 Published as a conference paper at ICLR 2026 DeepMind. Gemini pro,","work_id":"80ebdb1d-bb0a-4129-b956-a46c2f2142ee","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":63,"snapshot_sha256":"ac5979b07b927edb71b16f08b83e41bbc5f15aec38695c164a4aad9230367b3a","internal_anchors":15},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2507.05257","created_at":"2026-05-17T23:38:46.539508+00:00"},{"alias_kind":"arxiv_version","alias_value":"2507.05257v3","created_at":"2026-05-17T23:38:46.539508+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2507.05257","created_at":"2026-05-17T23:38:46.539508+00:00"},{"alias_kind":"pith_short_12","alias_value":"IPIU45KI5Y5T","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"IPIU45KI5Y5THLID","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"IPIU45KI","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":28,"internal_anchor_count":28,"sample":[{"citing_arxiv_id":"2605.20833","citing_title":"MemGym: a Long-Horizon Memory Environment for LLM Agents","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20926","citing_title":"MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14498","citing_title":"GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15710","citing_title":"SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18421","citing_title":"EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17830","citing_title":"Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17894","citing_title":"Evaluating Cognitive Age Alignment in Interactive AI Agents","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17894","citing_title":"Evaluating Cognitive Age Alignment in Interactive AI Agents","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01970","citing_title":"Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2508.03341","citing_title":"What Deserves Memory: Adaptive Memory Distillation for LLM Agents","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14498","citing_title":"GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2511.20857","citing_title":"Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory","ref_index":138,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02522","citing_title":"Opal: Private Memory for Personal AI","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11814","citing_title":"MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12061","citing_title":"SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory","ref_index":252,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12357","citing_title":"$\\delta$-mem: Efficient Online Memory for Large Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12493","citing_title":"LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09874","citing_title":"EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06365","citing_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05583","citing_title":"Belief Memory: Agent Memory Under Partial Observability","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01970","citing_title":"Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22085","citing_title":"Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19457","citing_title":"Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12007","citing_title":"When to Forget: A Memory Governance Primitive","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07313","citing_title":"When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory","ref_index":54,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2","json":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2.json","graph_json":"https://pith.science/api/pith-number/IPIU45KI5Y5THLIDAAODW5VMD2/graph.json","events_json":"https://pith.science/api/pith-number/IPIU45KI5Y5THLIDAAODW5VMD2/events.json","paper":"https://pith.science/paper/IPIU45KI"},"agent_actions":{"view_html":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2","download_json":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2.json","view_paper":"https://pith.science/paper/IPIU45KI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2507.05257&json=true","fetch_graph":"https://pith.science/api/pith-number/IPIU45KI5Y5THLIDAAODW5VMD2/graph.json","fetch_events":"https://pith.science/api/pith-number/IPIU45KI5Y5THLIDAAODW5VMD2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2/action/storage_attestation","attest_author":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2/action/author_attestation","sign_citation":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2/action/citation_signature","submit_replication":"https://pith.science/pith/IPIU45KI5Y5THLIDAAODW5VMD2/action/replication_record"}},"created_at":"2026-05-17T23:38:46.539508+00:00","updated_at":"2026-05-17T23:38:46.539508+00:00"}