{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:RZ7367QXH25YUOKYP43PJPHK2O","short_pith_number":"pith:RZ7367QX","schema_version":"1.0","canonical_sha256":"8e7fbf7e173ebb8a39587f36f4bcead394069ddcc3d7107926de73ed48948a0c","source":{"kind":"arxiv","id":"2511.20857","version":1},"attestation_state":"computed","paper":{"title":"Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LLM agents achieve continual improvement on streaming tasks by using the ReMem pipeline to integrate reasoning, actions, and memory updates.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Benjamin Coleman, Chi Wang, Derek Zhiyuan Cheng, Ed H. Chi, Fernando Pereira, Jingrui He, Mengting Ai, Noveen Sachdeva, Shuo Chen, Tianxin Wei, Wang-Cheng Kang, Xuying Ning, Yuanchen Bei, Yunzhe Li, Zhankui He","submitted_at":"2025-11-25T21:08:07Z","abstract_excerpt":"Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet ofte"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2511.20857","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2025-11-25T21:08:07Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"0d818bd916402e6653c573575779d522ab3b6ccd03e7edcf7f20c510c04d1e7e","abstract_canon_sha256":"91510619ee77a1e8bb1dbe58ace5f7ed30ee2190b6f2fa018506d5ca6f3c0544"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:19.897856Z","signature_b64":"N2Dh92hhcQCcpTGW7kj9eE9Vs7ANziGrEmsbXs95gEmQ/Jye9lIwXnMY7KHA4bzOMSUcZdRzSwWwjGBX/R4sDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"8e7fbf7e173ebb8a39587f36f4bcead394069ddcc3d7107926de73ed48948a0c","last_reissued_at":"2026-05-17T23:39:19.896925Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:19.896925Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LLM agents achieve continual improvement on streaming tasks by using the ReMem pipeline to integrate reasoning, actions, and memory updates.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Benjamin Coleman, Chi Wang, Derek Zhiyuan Cheng, Ed H. Chi, Fernando Pereira, Jingrui He, Mengting Ai, Noveen Sachdeva, Shuo Chen, Tianxin Wei, Wang-Cheng Kang, Xuying Ning, Yuanchen Bei, Yunzhe Li, Zhankui He","submitted_at":"2025-11-25T21:08:07Z","abstract_excerpt":"Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet ofte"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"ReMem, an action-think-memory refine pipeline, tightly integrates reasoning, task actions, and memory updates to achieve continual improvement in LLM agents on streaming tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen sequential task streams and the implemented memory modules faithfully capture the dynamics of real-world continuous interactions where memory evolution is required, without hidden implementation biases affecting the comparisons.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LLM agents achieve continual improvement on streaming tasks by using the ReMem pipeline to integrate reasoning, actions, and memory updates.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0ea7554e0285eee97172779bf5efa5883de6325821b2fa6e9aa4987c291f81aa"},"source":{"id":"2511.20857","kind":"arxiv","version":1},"verdict":{"id":"af7e0477-7a68-4bdb-a637-7436043acc6f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T23:08:31.892284Z","strongest_claim":"ReMem, an action-think-memory refine pipeline, tightly integrates reasoning, task actions, and memory updates to achieve continual improvement in LLM agents on streaming tasks.","one_line_summary":"Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen sequential task streams and the implemented memory modules faithfully capture the dynamics of real-world continuous interactions where memory evolution is required, without hidden implementation biases affecting the comparisons.","pith_extraction_headline":"LLM agents achieve continual improvement on streaming tasks by using the ReMem pipeline to integrate reasoning, actions, and memory updates."},"references":{"count":299,"sample":[{"doi":"","year":2009,"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","ref_index":1,"cited_arxiv_id":"2009.03300","is_internal_anchor":true},{"doi":"","year":null,"title":"International Conference on Learning Representations (ICLR) , year=","work_id":"1852f1a8-2303-4108-a8a5-0562f7716a9f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems (NeurIPS) , year=","work_id":"0cb97455-c4bf-4962-a363-31b7fd9dc41b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems (NeurIPS) , year=","work_id":"fda20f90-227f-46ae-9d68-9c841c704211","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"International Conference on Machine Learning (ICML) , year=","work_id":"98f812e7-24ab-4f7b-a3df-b17d84a7b2e4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":299,"snapshot_sha256":"053a4e2e41893da11f3db45f136b055dc708cc66c3dd725bf6e47a3ff4a38303","internal_anchors":36},"formal_canon":{"evidence_count":1,"snapshot_sha256":"ab70ab64680b2b0ec733a08592583ffb6ace64537130a4bf27dc69f776abcc09"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2511.20857","created_at":"2026-05-17T23:39:19.897095+00:00"},{"alias_kind":"arxiv_version","alias_value":"2511.20857v1","created_at":"2026-05-17T23:39:19.897095+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2511.20857","created_at":"2026-05-17T23:39:19.897095+00:00"},{"alias_kind":"pith_short_12","alias_value":"RZ7367QXH25Y","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"RZ7367QXH25YUOKY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"RZ7367QX","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":54,"internal_anchor_count":54,"sample":[{"citing_arxiv_id":"2606.22385","citing_title":"MetaPS: Adaptive Programmatic Strategy Selection for Market Agents","ref_index":132,"is_internal_anchor":true},{"citing_arxiv_id":"2606.20475","citing_title":"Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2606.31612","citing_title":"What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2606.09365","citing_title":"Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2606.06960","citing_title":"Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2606.06448","citing_title":"Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2606.05513","citing_title":"EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2606.05008","citing_title":"M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2606.02461","citing_title":"AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2606.31612","citing_title":"What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2606.31121","citing_title":"The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08704","citing_title":"AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07358","citing_title":"A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications","ref_index":140,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10913","citing_title":"Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2606.20625","citing_title":"AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.28773","citing_title":"Rethinking Memory as Continuously Evolving Connectivity","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2606.01223","citing_title":"Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2602.06470","citing_title":"Improve Large Language Model Systems with User Logs","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08216","citing_title":"MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20616","citing_title":"Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07358","citing_title":"A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications","ref_index":148,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18421","citing_title":"EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18747","citing_title":"Code as Agent Harness","ref_index":202,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17721","citing_title":"EXG: Self-Evolving Agents with Experience Graphs","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15384","citing_title":"Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory","ref_index":35,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O","json":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O.json","graph_json":"https://pith.science/api/pith-number/RZ7367QXH25YUOKYP43PJPHK2O/graph.json","events_json":"https://pith.science/api/pith-number/RZ7367QXH25YUOKYP43PJPHK2O/events.json","paper":"https://pith.science/paper/RZ7367QX"},"agent_actions":{"view_html":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O","download_json":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O.json","view_paper":"https://pith.science/paper/RZ7367QX","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2511.20857&json=true","fetch_graph":"https://pith.science/api/pith-number/RZ7367QXH25YUOKYP43PJPHK2O/graph.json","fetch_events":"https://pith.science/api/pith-number/RZ7367QXH25YUOKYP43PJPHK2O/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O/action/timestamp_anchor","attest_storage":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O/action/storage_attestation","attest_author":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O/action/author_attestation","sign_citation":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O/action/citation_signature","submit_replication":"https://pith.science/pith/RZ7367QXH25YUOKYP43PJPHK2O/action/replication_record"}},"created_at":"2026-05-17T23:39:19.897095+00:00","updated_at":"2026-05-17T23:39:19.897095+00:00"}