LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Title resolution pending
18 Pith papers cite this work. Polarity classification is still indexing.
years
2026 18verdicts
UNVERDICTED 18representative citing papers
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency across multiple embodied benchmarks.
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees on safety monotonicity, policy convergence, and accountability propagation.
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
SafeHarbor uses hierarchical memory with adversarial rule extraction and entropy-driven self-evolution to achieve over 93% refusal on harmful requests while reaching 63.6% benign utility on GPT-4o.
TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.
citing papers explorer
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
-
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency across multiple embodied benchmarks.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning
Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.
-
SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
SkillGraph represents skills as nodes in an evolving directed graph with typed dependency edges and updates the graph from RL trajectories to boost compositional task performance.
-
SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
SkillRAE organizes skills into a graph and compiles compact, grounded contexts for LLM agents, yielding 11.7% gains on SkillsBench over prior RAE methods.
-
Workspace Optimization: How to Train Your Agent
Workspace optimization evolves an agent's external workspace using multi-agent systems, with DreamTeam raising ARC-AGI-3 scores from 36% to 38.4% while using 31% fewer actions.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
-
Safe Bilevel Delegation (SBD): A Formal Framework for Runtime Delegation Safety in Multi-Agent Systems
SBD is a bilevel optimization framework that learns context-dependent safety weights for runtime task delegation in hierarchical multi-agent systems, with continuous authority transfer alpha and theoretical guarantees on safety monotonicity, policy convergence, and accountability propagation.
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut costs by 32%.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
-
ReflectCAP: Detailed Image Captioning with Reflective Memory
ReflectCAP distills model-specific hallucination and oversight patterns into Structured Reflection Notes that steer LVLMs toward more factual and complete image captions, reaching the Pareto frontier on factuality-coverage trade-offs.
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
-
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor uses hierarchical memory with adversarial rule extraction and entropy-driven self-evolution to achieve over 93% refusal on harmful requests while reaching 63.6% benign utility on GPT-4o.
-
Training-Free Test-Time Contrastive Learning for Large Language Models
TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
ActionNex: A Virtual Outage Manager for Cloud Computing
ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.