LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Expel: LLM agents are experiential learners
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7representative citing papers
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
MemQ integrates Q-learning with eligibility traces over provenance DAGs to assign credit in self-evolving memory agents, outperforming baselines on all six tested agent benchmarks with largest gains on deep multi-step tasks.
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
SkillOS is an RL recipe that learns to curate reusable skills for self-evolving LLM agents, outperforming memory-free and memory-based baselines while generalizing across executors and domains.
PrismAgent deploys four specialized LLM agents in sequence to analyze meme intent, gather context, make preliminary judgments, and deliver a final harm verdict, outperforming prior zero-shot methods on three public datasets.
citing papers explorer
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
-
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
MemQ integrates Q-learning with eligibility traces over provenance DAGs to assign credit in self-evolving memory agents, outperforming baselines on all six tested agent benchmarks with largest gains on deep multi-step tasks.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
-
SkillOS: Learning Skill Curation for Self-Evolving Agents
SkillOS is an RL recipe that learns to curate reusable skills for self-evolving LLM agents, outperforming memory-free and memory-based baselines while generalizing across executors and domains.
-
PrismAgent: Illuminating Harm in Memes via a Zero-Shot Interpretable Multi-Agent Framework
PrismAgent deploys four specialized LLM agents in sequence to analyze meme intent, gather context, make preliminary judgments, and deliver a final harm verdict, outperforming prior zero-shot methods on three public datasets.