hub Canonical reference

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song · 2025 · cs.AI · arXiv 2504.07079

Canonical reference. 100% of citing Pith papers cite this work as background.

48 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 48 citing papers arXiv PDF

abstract

To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8

citation-polarity summary

background 8

representative citing papers

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.

Co-Evolving Skill Generation and Policy Optimization

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

SkillHarness models skill learning as a safety-constrained process with skill boundaries and selective reuse, reducing unsafe skills by 57.1% and improving stability in dynamic settings.

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.

MMSkills: Towards Multimodal Skills for General Visual Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 3 refs

MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

cs.AI · 2025-09-08 · conditional · novelty 7.0

MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.

Analytic Concept-Centric Memory for Agentic Embodied Manipulation

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

Proposes a structured concept-centric memory system for embodied agents that connects object, scene, transition, and skill memories to support coarse-to-fine retrieval and improve task performance over baselines.

Joint Learning of Experiential Rules and Policies for Large Language Model Agents

cs.AI · 2026-06-25 · unverdicted · novelty 6.0

JERP jointly updates experiential rules and policies for LLM agents from shared trajectories, keeping rules aligned with the policy and yielding gains on AlfWorld and WebShop.

MetaPS: Adaptive Programmatic Strategy Selection for Market Agents

cs.AI · 2026-06-21 · unverdicted · novelty 6.0

MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

SkillMigrator reduces LLM-action counts by 8-10% on WebArena and Mind2Web by transferring web skills via layout-matched transferable interaction patterns.

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

Bayesian-Agent maintains feature-conditioned categorical posteriors over skills/SOPs from verified trajectories and maps them to actions that improve benchmark scores on SOP-Bench, Lifelong AgentBench, and RealFin-Bench.

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

cs.AI · 2026-06-06 · unverdicted · novelty 6.0

SKILL.nb uses selective formalization and gate-conditioned execution in auditable notebooks to improve durability of agent workflows, achieving 53.7% success on WebArena-Verified with 91.7% retention across re-executions.

Inducing Reasoning Primitives from Agent Traces

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

Reasoning Primitive Induction mines ReAct traces to build a library of typed pseudo-tools that, when composed in a standard ReAct loop, outperform the original agent by 22-44 percentage points on five subtasks.

Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.

unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning

cs.CR · 2026-05-27 · unverdicted · novelty 6.0

unix-ctf procedurally generates 656 Unix CTF tasks across 155 techniques; fine-tuning Qwen3-8B on them raises solve rate from 11.6% to 43.6% on a 15-skill holdout and yields +33 pp in Forensics on InterCode-CTF.

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

cs.CL · 2026-05-22 · unverdicted · novelty 6.0 · 2 refs

OpenSkillEval dynamically builds task instances across five application domains to evaluate 30 open skills with over 600 tests, finding that skill use depends heavily on model and framework and that many popular skills do not beat base agents.

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

EvoMemBench evaluates 15 memory methods for LLM agents and finds long-context baselines competitive with no single memory approach working consistently across settings.

DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Co-Evolving Skill Generation and Policy Optimization cs.CL · 2026-06-07 · unverdicted · none · ref 24 · internal anchor
Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses cs.CL · 2026-06-06 · unverdicted · none · ref 19 · internal anchor
Bayesian-Agent maintains feature-conditioned categorical posteriors over skills/SOPs from verified trajectories and maps them to actions that improve benchmark scores on SOP-Bench, Lifelong AgentBench, and RealFin-Bench.
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents cs.CL · 2026-05-22 · unverdicted · none · ref 25 · 2 links · internal anchor
OpenSkillEval dynamically builds task instances across five application domains to evaluate 30 open skills with over 600 tests, finding that skill use depends heavily on model and framework and that many popular skills do not beat base agents.
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective cs.CL · 2026-05-18 · unverdicted · none · ref 44 · internal anchor
EvoMemBench evaluates 15 memory methods for LLM agents and finds long-context baselines competitive with no single memory approach working consistently across settings.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 236 · internal anchor
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer