Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
hub
Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
MemEvolve jointly evolves agent experiential knowledge and memory architectures via a modular codebase, delivering up to 17% gains on agent benchmarks with cross-task and cross-model generalization.
MetaForge proposes a self-evolving multimodal agent with decide-retrieve-adapt-forge-recycle stages jointly optimized by RL to dynamically manage and create tools, outperforming baselines on 12 benchmarks.
FluxMem evolves memory as a heterogeneous graph via three refinement stages and reports consistent state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.
BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
DRIVE disentangles reasoning and interaction skills for web agents via dual-level modeling and scene-aware coordination, reaching 52.8% success on WebArena tasks.
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
Agent libOS is a runtime substrate for capability-controlled self-evolving LLM agents that completed 27 deterministic tasks without unauthorized side effects while maintaining a 7% false-denial rate.
SimpleMem proposes semantic structured compression, online synthesis, and intent-aware retrieval to create efficient lifelong memory for LLM agents, reporting 26.4% F1 gains and up to 30x lower token use on LoCoMo benchmarks.
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
citing papers explorer
-
MetaForge: A Self-Evolving Multimodal Agent that Retrieves, Adapts, and Forges Tools On Demand
MetaForge proposes a self-evolving multimodal agent with decide-retrieve-adapt-forge-recycle stages jointly optimized by RL to dynamically manage and create tools, outperforming baselines on 12 benchmarks.