arxiv: 2504.07079 · v1 · submitted 2025-04-09 · 💻 cs.AI · cs.CL· cs.CV

Recognition: 3 theorem links

· Lean Theorem

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Apurva Gandhi, Boyuan Zheng, Gaowen Liu, Graham Neubig, Jayanth Srinivasa, Michael Y. Fatemi, Xiaolong Jin, Yueqi Song, Yu Gu, Yu Su, Zora Zhiruo Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords web agentsself-improvementskill discoveryAPI synthesisWebArenaautonomous agentsskill librarytransferable skills

0 comments

The pith

Web agents autonomously discover skills on websites and distill them into reusable APIs to boost their own performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SkillWeaver lets web agents explore new sites, identify useful procedures, practice executing them, and refine the results into reliable, plug-and-play APIs. These APIs accumulate into an expanding library that the agent can draw on for future tasks without starting from scratch each time. The process runs iteratively, so the agent's skill set grows through repeated cycles of discovery and honing. Experiments on WebArena and real websites show this self-improvement raises success rates, and skills created by stronger agents transfer to improve weaker ones as well.

Core claim

The paper claims that agents can self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs that significantly enhance the agent's capabilities, producing relative success rate improvements of 31.8 percent on WebArena and 39.8 percent on real-world websites, with APIs from strong agents improving weaker agents by up to 54.3 percent on WebArena.

What carries the argument

SkillWeaver, a skill-centric framework in which agents discover skills on websites, practice them through execution, and distill the results into reusable, lightweight APIs.

If this is right

Agents accumulate a growing library of lightweight APIs that can be reused across tasks.
Success rates rise on both simulated benchmarks and real websites through self-directed practice.
Skills created by capable agents transfer directly to improve the performance of weaker agents.
The library expands over repeated cycles of exploration without requiring external supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Skill libraries could be exchanged between entirely separate agent systems to accelerate collective progress.
The same distillation process might apply to other structured interfaces such as mobile apps or desktop software.
Agents could begin to generalize distilled skills to websites they have never visited during the original synthesis.
Over many iterations the reliance on per-task language-model prompting could decrease as the API library matures.

Load-bearing premise

Autonomously discovered skills can be reliably distilled into robust APIs whose performance gains hold across different agents and websites rather than depending on the exact synthesis conditions.

What would settle it

No measurable increase in task success rates when the distilled APIs are transferred to a new agent model or to websites outside the ones used for skill synthesis.

read the original abstract

To survive and thrive in complex environments, humans have evolved sophisticated self-improvement mechanisms through environment exploration, hierarchical abstraction of experiences into reuseable skills, and collaborative construction of an ever-growing skill repertoire. Despite recent advancements, autonomous web agents still lack crucial self-improvement capabilities, struggling with procedural knowledge abstraction, refining skills, and skill composition. In this work, we introduce SkillWeaver, a skill-centric framework enabling agents to self-improve by autonomously synthesizing reusable skills as APIs. Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs. Iterative exploration continually expands a library of lightweight, plug-and-play APIs, significantly enhancing the agent's capabilities. Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively. Additionally, APIs synthesized by strong agents substantially enhance weaker agents through transferable skills, yielding improvements of up to 54.3% on WebArena. These results demonstrate the effectiveness of honing diverse website interactions into APIs, which can be seamlessly shared among various web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillWeaver gives agents a loop to discover skills on sites, practice them, and turn them into shareable APIs, with reported gains on WebArena and real sites plus transfer to weaker agents, but the distillation may embed prompting artifacts.

read the letter

SkillWeaver lets web agents explore a new site, run actions to practice, and distill those traces into lightweight APIs that get added to a shared library. The main results are relative success rate gains of 31.8% on WebArena and 39.8% on real-world sites, plus up to 54.3% lift when strong-agent APIs are given to weaker ones. The transfer part is the clearest signal that something reusable is being created rather than just better single-agent prompting.

Referee Report

2 major / 1 minor

Summary. The paper introduces SkillWeaver, a framework in which web agents autonomously discover skills on new websites, execute them for practice, and distill the resulting trajectories into lightweight, reusable APIs that are added to a growing library. These APIs are then composed during downstream task execution. Experiments on WebArena and real-world sites report relative success-rate gains of 31.8% and 39.8%, respectively; an additional transfer experiment shows that APIs synthesized by strong agents improve weaker agents by up to 54.3%.

Significance. If the reported gains are shown to arise from genuine, transferable skill abstraction rather than prompting or site-specific artifacts, the work would constitute a meaningful step toward self-improving web agents. The emphasis on distilling practice into plug-and-play APIs addresses a recognized gap in procedural knowledge reuse and cross-agent transfer.

major comments (2)

[Experiments] Experiments section (WebArena and real-world results): the paper does not report an ablation that isolates the contribution of the distilled APIs from the additional prompting and trajectory information present during the synthesis phase. Without such a control, it is impossible to rule out that the 31.8% and 39.8% gains partly reflect stronger prompting rather than the claimed skill abstraction.
[Transfer experiments] Transfer experiment (weaker-agent results): the description does not specify whether the weaker agents receive the same base prompt template and execution budget when the synthesized APIs are provided versus when they are not. This omission leaves open the possibility that the 54.3% improvement is driven by prompt engineering differences rather than the transferability of the distilled skills.

minor comments (1)

[Method] The abstract states that skills are 'hierarchically abstracted,' yet the method section provides no explicit description of how hierarchy is constructed or enforced during distillation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer controls in our experiments. We have revised the manuscript to include the requested ablation and to explicitly document the transfer experiment setup. Point-by-point responses follow.

read point-by-point responses

Referee: [Experiments] Experiments section (WebArena and real-world results): the paper does not report an ablation that isolates the contribution of the distilled APIs from the additional prompting and trajectory information present during the synthesis phase. Without such a control, it is impossible to rule out that the 31.8% and 39.8% gains partly reflect stronger prompting rather than the claimed skill abstraction.

Authors: We agree that an explicit ablation is necessary to isolate the effect of API distillation from synthesis-phase prompting and trajectories. In the revised manuscript we have added a controlled ablation (new Section 4.3 and Table 3) comparing three conditions on WebArena: (1) full SkillWeaver with distilled APIs, (2) a trajectory-augmented baseline that appends the raw synthesis trajectories to the prompt without distillation, and (3) the original base-prompt baseline. The API-distilled condition yields an additional 12.4% relative success-rate gain over the trajectory-augmented baseline, indicating that the abstraction itself contributes beyond prompting. We report the same pattern on real-world sites. revision: yes
Referee: [Transfer experiments] Transfer experiment (weaker-agent results): the description does not specify whether the weaker agents receive the same base prompt template and execution budget when the synthesized APIs are provided versus when they are not. This omission leaves open the possibility that the 54.3% improvement is driven by prompt engineering differences rather than the transferability of the distilled skills.

Authors: We apologize for the missing detail. In the original experiments, weaker agents in both conditions used identical base prompt templates and the same execution budget (maximum 20 steps per task, three independent trials). The sole difference was the optional insertion of the API library description into the prompt. We have revised Section 4.4 to state this controlled protocol explicitly. Consequently, the reported 54.3% improvement is attributable to the transferable skills rather than prompt differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SkillWeaver's empirical self-improvement claims

full rationale

The paper's core claims rest on experimental measurements of success-rate gains (31.8% on WebArena, 39.8% on real-world sites, 54.3% transfer) obtained by comparing agents with and without the autonomously synthesized API library on held-out tasks. No equations, fitted parameters renamed as predictions, or self-citations are invoked to derive these numbers; the improvements are reported as direct outcomes of the described discovery-practice-distillation loop evaluated externally. The framework is therefore self-contained against the stated benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that website interactions can be cleanly abstracted into stable, composable APIs without site-specific engineering. No explicit free parameters are named in the abstract, but the skill-discovery and distillation steps implicitly require several thresholds and prompting choices.

free parameters (2)

skill discovery thresholds
Parameters controlling when an interaction sequence is deemed a reusable skill; values not stated in abstract.
practice iteration count
Number of executions used to refine each API; chosen to produce the reported robustness.

axioms (1)

domain assumption Website interactions admit stable, reusable abstractions that remain valid across sessions and minor site changes.
Invoked when the paper claims distilled APIs are plug-and-play.

invented entities (1)

Skill API no independent evidence
purpose: Lightweight callable wrapper that encapsulates a discovered interaction sequence for later reuse and composition.
New abstraction introduced by the framework; no independent evidence of stability outside the reported experiments.

pith-pipeline@v0.9.0 · 5545 in / 1365 out tokens · 33948 ms · 2026-05-14T19:30:48.404714+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Skill-R1: Agent Skill Evolution via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.
SkillGen: Verified Inference-Time Agent Skill Synthesis
cs.LG 2026-05 unverdicted novelty 6.0

SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 6.0

SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

SCOPE maintains semantic commitments via structured specifications and conditional skill orchestration, achieving 0.60 EGIP on the new Gen-Arena benchmark while outperforming baselines on WISE-V and MindBench.
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
cs.AI 2026-04 unverdicted novelty 6.0

ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
cs.AI 2026-04 conditional novelty 6.0

The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks
cs.AI 2026-05 unverdicted novelty 5.0

NSI lifts interaction traces into logic programs to enable few-shot skill induction and adaptation for long-horizon agentic tasks.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture
cs.AI 2026-04 unverdicted novelty 5.0

ANX introduces a protocol-first design with 3EX architecture that cuts token consumption by 47-66% and execution time by 58% versus prior methods in form-filling tests.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 21 Pith papers · 1 internal anchor

[1]

URL https://openreview.net/forum?id=WE vluYUL-X

OpenReview.net, 2023. URL https://openreview.net/forum?id=WE vluYUL-X. 14 Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? In Conference on Empirical Methods in Natural Language Processing , 2024. URL https://api.semanticscholar.org/Co...

work page doi:10.1007/978-3-031-73039-9 2023
[2]

Language Model Cascades: Token-Level Uncertainty and Beyond

OpenReview.net, 2024a. URL https://openreview.net/forum?id=oKn9c6ytLx. Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Li Erran Li. Proposer-agent-evaluator(pae): Autonomous skill discovery for foundation model internet agents. CoRR, abs/2412.13194, 2024b. doi: 10.48550/ARXIV . 2412.13194. URL https://doi.org...

work page internal anchor Pith review doi:10.48550/arxiv 2024
[3]

A function named act not provided: Function name must be ’act’

work page
[4]

The function act does not contain a single argument, page: Function must take exactly one argument: ‘page‘

work page
[5]

Please use another function

The function function is disabled: Function ‘fnname‘ is disabled. Please use another function

work page
[6]

where f ∈ {click , f ill, type }

Incorrect or buggy Playwright functions: Please use the ‘page.get by ...().f()‘ functions instead of the ‘page.f(selector)‘ functions. where f ∈ {click , f ill, type }

work page
[7]

Use of CSS selectors instead of accessibility tree selectors: Please use Accessibility Tree-centric selectors, like ‘page.get by role()‘, ‘.nth()‘, instead of the CSS-style selectors like ‘.locator()‘ or ‘.query selector()‘

work page
[8]

Please provide one or the other

Blank response: You did not provide any Python code, but you also did not provide a result for ‘terminate with result‘. Please provide one or the other

work page
[9]

group", name=

Type errors: Type Error: {error} A.4 Action Synthesis Code Agent Action Generation You generate Playwright code to interact with websites. Words of wisdom: • If you want to click a generic button (e.g., that belongs to an element), use the full .get by role() path to the element (e.g., .get by role("group", name="Test Item").get by role("button", name="Go...

work page 2029
[10]

Screenshot of the webpage interface

work page
[11]

Consider: − Whether the webpage is important for information seeking and the user's needs

A summary of the page content and functionality Please analyze whether information seeking APIs should be implemented for this page. Consider: − Whether the webpage is important for information seeking and the user's needs. − Whether user need to get data from the webpage. − Data retrieval needs. Here are some examples of information seeking tasks: − get ...

work page
[13]

You are already on the correct page, do not need to navigate to another page

work page
[16]

Do not include any additional text or comments

Please just provide the code in a code block. Do not include any additional text or comments. Do not include Usage. Just return the API code that should be inside the function

work page
[17]

When calling async methods that return objects with other async methods

Code: − Always explicitly handle async/await chaining. When calling async methods that return objects with other async methods. Ensure each async operation in the chain is properly awaited. Use parentheses to clearly show the await order when chaining − Common patterns to watch for: − WRONG: result = await obj.async method().another async method() − CORRE...

work page
[18]

If the task is about information seeking, please make sure the information is as comprehensive as possible

work page
[19]

Please review your generated code specifically checking for: − Missing await statements − Proper async method chaining − Correct handling of async property access

work page
[20]

It always BUG

Do not include ```await page.wait for selector (' selector ')``` in the code. It always BUG

work page
[21]

Make sure the element selector is correct and precise

work page
[22]

If the page already contains information regarding the task, do not use page.goto() to navigate to the page

work page
[23]

If you need to navigate to a page, use page.goto() with the relative URL

work page
[24]

Do not include any other code like main()

Please only return fixed API code. Do not include any other code like main()

work page
[25]

Task Description: {TASK DESCRIPTION} Code Execution Result: {RESULT} Please analyze whether the result matches the task requirements

HTML Content (truncated if too long): {HTML CONTENT} 24 Verifying Scraping Result You are a verification system that checks if the code execution result matches the task requirements. Task Description: {TASK DESCRIPTION} Code Execution Result: {RESULT} Please analyze whether the result matches the task requirements. Think about:

work page
[26]

Does the result contain required information based on the task description?

work page
[27]

Answer ”is correct” if the result meets all requirements

Does the content make sense given the HTML context? Compare the extracted data with the content visible in the HTML to ensure accuracy and completeness. Answer ”is correct” if the result meets all requirements. If not, provide specific suggestions for improvement. Current HTML Content (for reference): {HTML CONTENT} Fixing Code for Incorrect Scraping Resu...

work page
[28]

Analyze the HTML structure to understand the page elements and locate what the task requires

work page
[29]

Identify the cause of the error in the code

work page
[30]

Consider common Playwright issues like: − Selector timing issues

work page
[31]

Provide a complete fixed version of the code

work page
[32]

The HTML content provided is a truncated version of the webpage structure, because of constraints on the context window size

work page
[33]

The input parameter for the API must include `page`, which is a Playwright `Page` object

work page
[34]

Do not include the browser object

work page
[35]

Do not define the browser in the API code

work page
[36]

Do not need goto() method because the page already contains the necessary information

work page
[37]

"" import asyncio await page . goto (

Do not include ```await page.wait for selector (' selector ')``` in the code. It always BUG. Please return ONLY the fixed code without any explanation or markdown formatting within the code block. The code should be a complete, runnable solution that includes all necessary imports. Please only return fixed API code. Do not include any other code like main...

work page 2000
[38]

I need to delete merged branches in the specified project ' byteblaze/accessible−html− content−patterns'

work page
[39]

31 Figure 5: Apply Discount Code API execution

I can utilize the `delete merged branches` function which is predefined to handle this operation effectively in the GitLab UI. 31 Figure 5: Apply Discount Code API execution. Before, the agent is on a random page (left), and after, the agent successfully navigates to the shopping cart and enters a discount code, WELCOME10 (right)

work page
[40]

byteblaze / accessible - html - content - patterns

Since the page indicates that I am already within the correct project context, I can proceed directly with invoking the function using the provided project path. </reasoning> <code> async def act ( page ) : # Call the function to delete merged branches await d e l e t e _ m e r g e d _ b r a n c h e s ( page , " byteblaze / accessible - html - content - p...

work page
[41]

<input id=”search” type=”text” name=”search” autocomplete=”off” placeholder=” Search GitLab” data−qa−selector=”search box” class=”form−control gl−form− input gl−search−box−by−type−input”/ > aka get by placeholder(”Search GitLab”)

work page
[42]

dialog

<input type=”text” id=” BVID 158” autocomplete=”off” aria−labelledby=”input− label” data−qa−selector=”delete merged branches input” class=”gl−form−input gl−mt−2 form−control gl−form−input−sm”/ > aka locator(”[id=\” BVID 158 \”]”) Call log: </exception> −−− <state 1> 32 URL: /byteblaze/accessible−html−content−patterns/−/branches [Omitted] </state 1> <reaso...

work page