{"total":12,"items":[{"citing_arxiv_id":"2605.20833","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemGym: a Long-Horizon Memory Environment for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-20T07:25:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17829","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Interactive Evaluation Requires a Design Science","primary_cat":"cs.AI","submitted_at":"2026-05-18T04:03:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14678","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-14T10:47:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14498","ref_index":10,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations","primary_cat":"cs.CL","submitted_at":"2026-05-14T07:38:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12493","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues","primary_cat":"cs.CL","submitted_at":"2026-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11814","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:06:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Existing benchmarks are largely eitherdialogue-centricorlong-context. Dialogue- centric benchmarks, such as LoCoMo [ 21], LongMemEval [ 34], MemoryAgentBench [ 14], HaluMem [ 4], and AMA-Bench [42], evaluate conversational memory, memory operations, or agent trajectories under general-domain settings. Related benchmarks including PersonaMem [15], MemBench [30], MemoryArena [12], Memora [31], and RealTalk [16] further study dynamic profiling, continual memory, and personalized agents. By contrast, long-context or interactive environments such as RULER [13], LongBench [2], WebArena [44], and ALFWorld [28] focus on static context processing or non-medical task environments. Overall, existing benchmarks do not target personalized medical"},{"citing_arxiv_id":"2605.10870","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory","primary_cat":"cs.AI","submitted_at":"2026-05-11T17:20:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"URL https: //arxiv.org/abs/2405.14831. [10] Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and Alex Pentland. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks, 2026. URLhttps://arxiv.org/abs/2602.16313. [11] Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in LLM agents via in- cremental multi-turn interactions. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=DT7JyQC3MR. [12] Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu"},{"citing_arxiv_id":"2605.09134","ref_index":1,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models","primary_cat":"cs.AI","submitted_at":"2026-05-09T19:31:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"With allocationa t, the policy gradient becomes: ˆ∇θJ= TX t=1 rt∇θ logπ θ(yt|x, y<t) =R TX t=1 at∇θ logπ θ(yt|x, y<t).(17) 12 BOOSTAPR: Execution-Grounded RL for Automated Program Repair If allocation concentrates onS, i.e.,a t ≈0fort /∈ Sand P t∈S at ≈1: ˆ∇θJ≈R X t∈S at∇θ logπ θ(yt|x, y<t).(18) Following the same analysis as Proposition A.1: Var h ˆ∇θJ i =O(|S| ·Var[R]) =O(k·Var[R]).(19) RemarkA.3.For code patches, edit lines typically constitute a small fraction of the total output. If a 100-token patch contains 20 tokens of actual edits (with the rest being headers, context, and formatting), a perfect allocator achieves 5 × variance reduction. A.2. Optimal Allocation and the Role ofR line The theoretical analysis above assumes access to an oracle allocation."},{"citing_arxiv_id":"2605.07313","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory","primary_cat":"cs.AI","submitted_at":"2026-05-08T06:22:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03354","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis","primary_cat":"cs.AI","submitted_at":"2026-05-05T04:17:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17886","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent Preference Modeling for Cross-Session Personalized Tool Calling","primary_cat":"cs.CL","submitted_at":"2026-04-20T06:57:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.23231","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments","primary_cat":"cs.AI","submitted_at":"2026-03-24T14:04:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}