hub

Lifelonga- gentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners , author= · 2025 · arXiv 2505.11942

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1 method 1

citation-polarity summary

background 1 use dataset 1 use method 1

representative citing papers

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

CL-Bench is the first expert-validated benchmark for continual learning in frontier LLMs across six real-world domains, showing limited gains and that naive in-context learning outperforms dedicated memory systems.

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

M³Eval is a new cognitively-grounded benchmark that evaluates memory dimensions in multi-modal video models and reports consistent model weaknesses in disentanglement, interference, spatial-temporal grounding, and symbolic recall.

From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

MemoPilot trains memory updates for LLM agents via multi-turn GRPO on RPS and poker, achieving top Elo scores and outperforming baselines including DeepSeek-V3.2.

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

Bayesian-Agent maintains feature-conditioned categorical posteriors over skills/SOPs from verified trajectories and maps them to actions that improve benchmark scores on SOP-Bench, Lifelong AgentBench, and RealFin-Bench.

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

LifeSkill is a verifier-guided skill learning plus online internalization framework that raises average performance by 7 points over lifelong agent baselines on LifelongAgentBench.

Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

cs.CL · 2026-04-18 · unverdicted · novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.

LLMs Corrupt Your Documents When You Delegate

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 6.0

Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.

Always-OnAgents:A Survey of Persistent Memory, State, and Governance in LLMAgents

cs.MA · 2026-06-29 · unverdicted · novelty 5.0

Survey mapping persistent state in LLM agents along six axes and proposing the AOEP-v0 protocol to evaluate governance and recovery obligations.

Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

cs.CL · 2026-06-05 · unverdicted · novelty 5.0

Introduces FinEvolveBench and Tree-of-Experience showing structured experience management improves LLM agent performance over baselines in low-repetition implicit-reward settings.

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

cs.AI · 2026-06-01 · unverdicted · novelty 5.0

AgentCL constructs controlled task streams with intentional reusability and introduces MemProbe to evaluate non-parametric memory designs for continual learning in language agents across coding, research, and reasoning tasks.

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

MUSE-Autoskill introduces a skill-centric framework for self-evolving LLM agents through a unified lifecycle of skill creation, memory, management, evaluation, and refinement.

AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

AlphaMemo equips LLM alpha-mining agents with AST-diff motif memory, residual learning, and asymmetric veto control to improve out-of-sample factor discovery on CSI 500 and S&P 500.

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

cs.CL · 2026-05-18 · unverdicted · novelty 5.0

MINTEval benchmark shows current memory-augmented systems average 27.9% accuracy on long-horizon interference tasks, limited by retrieval and memory construction with degradation from intervening updates.

Learning CLI Agents with Structured Action Credit under Selective Observation

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

cs.AI · 2026-05-07 · conditional · novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems

cs.MA · 2026-03-27 · unverdicted · novelty 5.0

LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

cs.LG · 2026-05-31 · unverdicted · novelty 4.0

The paper calls for life cycle assessment to capture embodied hardware costs and full pipeline operational costs in AI development and deployment.

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

cs.AI · 2025-07-28 · accept · novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

citing papers explorer

Showing 1 of 1 citing paper after filters.

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 46
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

Lifelonga- gentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer