arxiv: 2604.08377 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.CL

Recognition: no theorem link

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Shidong Yang, Tongwen Huang, Xiangxiang Chu, Xucong Wang, Yiming Hu, Yong Wang, Yuxiang Ji, Ziyu Ma

Pith reviewed 2026-05-10 17:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords LLM agentsskill evolutionmulti-user systemstrajectory aggregationautonomous evolvercollective improvementshared skill repository

0 comments

The pith

SkillClaw collects multi-user interaction trajectories and uses an autonomous evolver to refine a shared skill set, enabling cumulative improvements that propagate across all users.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current LLM agents rely on reusable skills that stay fixed after initial setup, so similar failures and workarounds keep recurring for different people. SkillClaw instead gathers the actual trajectories produced during real use and feeds them into an autonomous evolver that spots repeating patterns. The evolver then revises existing skills or adds new ones, storing the results in a repository that every user accesses automatically. Because the updates draw on complementary signals from many contexts, one user's discovery can help everyone else without any extra work on their part. Experiments on WildClawBench indicate that even modest amounts of such feedback raise the success rate of a strong base model in open-ended agent tasks.

Core claim

SkillClaw treats cross-user and over-time interactions as the primary signal for improving skills. It continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users.

What carries the argument

The autonomous evolver, which aggregates heterogeneous trajectories, detects recurring patterns, and produces targeted refinements to a shared skill repository.

If this is right

Refinements discovered in one user's workflow become available to every other user through the shared repository.
Agents accumulate capability over time without requiring retraining or manual skill authoring.
Performance gains appear even when each user supplies only limited interaction data and feedback.
The same mechanism supports both refinement of existing skills and addition of entirely new capabilities as patterns emerge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend naturally to domains where agents already log trajectories, such as code editing or web navigation, without changing the core loop.
If trajectory volume grows large, the evolver would need internal checks to prevent error amplification, an issue left open by the current design.
Over many users the shared repository might converge on skills that generalize better than those crafted for any single user profile.

Load-bearing premise

Diverse user trajectories supply consistent, non-conflicting signals that the evolver can reliably convert into correct skill changes without creating new errors or needing human correction.

What would settle it

Measure whether skills produced by the evolver on mixed user trajectories raise or lower success rates on WildClawBench tasks compared with the original static skill set; a drop would indicate the assumption fails.

read the original abstract

Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SkillClaw, a framework for collective skill evolution in multi-user LLM agent ecosystems such as OpenClaw. It continuously aggregates cross-user interaction trajectories, processes them via an autonomous evolver that identifies recurring patterns and produces skill updates (refinements or extensions), maintains the updated skills in a shared repository synchronized across users, and reports that experiments on WildClawBench demonstrate significant performance gains for Qwen3-Max in real-world agent scenarios under limited interaction and feedback.

Significance. If the experimental claims hold with proper controls and the evolver mechanism is sound, the work could meaningfully advance multi-user agent systems by enabling automatic, cumulative skill improvement and cross-user knowledge transfer without requiring additional user effort. This addresses a practical limitation of static skill repositories in deployed agents.

major comments (2)

[Abstract] Abstract: The description of the autonomous evolver states that it 'identifies recurring behavioral patterns and translates them into updates' but supplies no mechanism for detecting, filtering, or reconciling conflicting signals across heterogeneous user trajectories (e.g., a skill succeeding for user A but failing for user B due to differing contexts or tool versions). This omission is load-bearing for the central claim of reliable collective evolution, as aggregation without arbitration risks propagating new failure modes system-wide.
[Abstract] Abstract: The experimental claim of 'significant' improvement for Qwen3-Max on WildClawBench is asserted without any reported baselines, metrics, error bars, ablation studies, or statistical analysis. This prevents verification of whether the gains are attributable to the collective-evolution mechanism rather than other factors.

minor comments (1)

[Abstract] Abstract: The sentence 'experiments on WildClawBench show that limited interaction and feedback, it significantly improves...' is grammatically incomplete and should be rephrased for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the abstract to improve its completeness and support for the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: The description of the autonomous evolver states that it 'identifies recurring behavioral patterns and translates them into updates' but supplies no mechanism for detecting, filtering, or reconciling conflicting signals across heterogeneous user trajectories (e.g., a skill succeeding for user A but failing for user B due to differing contexts or tool versions). This omission is load-bearing for the central claim of reliable collective evolution, as aggregation without arbitration risks propagating new failure modes system-wide.

Authors: We agree that the abstract, as a concise summary, does not detail the evolver's handling of conflicting signals from heterogeneous trajectories. The full manuscript describes the evolver's pattern detection and update logic. We will revise the abstract to briefly outline how recurring patterns are identified and reconciled across users to support the claim of reliable collective evolution. revision: yes
Referee: [Abstract] Abstract: The experimental claim of 'significant' improvement for Qwen3-Max on WildClawBench is asserted without any reported baselines, metrics, error bars, ablation studies, or statistical analysis. This prevents verification of whether the gains are attributable to the collective-evolution mechanism rather than other factors.

Authors: The referee is correct that the abstract does not include these experimental details. The manuscript's experimental section reports baseline comparisons, performance metrics, error bars, ablation studies, and statistical analysis. We will revise the abstract to summarize the key quantitative results and controls, enabling better verification of the gains. revision: yes

Circularity Check

0 steps flagged

No circularity in framework description or experimental claims

full rationale

The paper presents SkillClaw as a descriptive framework for aggregating user trajectories into skill updates via an autonomous evolver, with performance gains demonstrated through experiments on WildClawBench. No equations, derivations, fitted parameters, or first-principles predictions appear in the abstract or described structure that could reduce to inputs by construction. Claims of cross-user knowledge transfer rest on system design and empirical results rather than self-referential definitions or self-citation chains. The derivation chain is therefore self-contained with independent content from the experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that user interactions supply usable signals for skill improvement and introduces the autonomous evolver as a new processing component without external validation.

axioms (1)

domain assumption Interactions from different users provide complementary signals about when a skill works or fails
Explicitly stated in the abstract as the primary signal for updates.

invented entities (1)

autonomous evolver no independent evidence
purpose: Identifies recurring behavioral patterns in trajectories and translates them into skill updates
New component introduced by the paper to perform the evolution step; no independent evidence of its correctness is provided in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1362 out tokens · 31842 ms · 2026-05-10T17:53:06.641644+00:00 · methodology

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
cs.CL 2026-04 unverdicted novelty 7.0

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
cs.AI 2026-05 unverdicted novelty 5.0

Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to ...
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

Reference graph

Works this paper leans on

30 extracted references · cited by 10 Pith papers

[1]

Goal: The overall task the user wanted to accomplish
[2]

read skill X →attempted approach Y →hit error Z →switched to W

Key trajectory: The step-by-step path the agent took — what it tried, in what order, and why (e.g., “read skill X →attempted approach Y →hit error Z →switched to W”)
[3]

Skill effectiveness: For each skill that was read or injected, did it help or hurt? Was it relevant to the task? Was any guidance missing or wrong?
[4]

What caused failures? What enabled successes?

Critical turning points : Where things went right or wrong. What caused failures? What enabled successes?
[5]

Tool usage patterns : Which tools were used effectively, which caused errors, and any recurring patterns
[6]

Focus on preserving the sequence of events and causal relationships

Outcome: Final result quality and what could have gone better. Focus on preserving the sequence of events and causal relationships . This summary will be used to decide whether skills need improvement, so be specific about what skill guidance helped, what was missing, and what was misleading. Output ONLY the plain-text summary — no JSON, no markdown fence...
[7]

Produce the updated skill

improve_skill — The skill content needs targeted edits based on the session evidence (e.g., missing guidance, outdated information, unclear instructions). Produce the updated skill
[8]

Rewrite ONLY the description for more precise triggering

optimize_description — The skill body content is fine, but its description causes it to be matched to wrong tasks. Rewrite ONLY the description for more precise triggering. Do NOT change the body content
[9]

A brand-new, separate skill is needed

create_skill — The session evidence reveals a recurring pattern, capability gap, or reusable strategy that does NOT belong in the current skill {skill_name}. A brand-new, separate skill is needed. The current skill remains unchanged. Only choose this when the pattern is clearly distinct from the current skill’s purpose and cannot be addressed by improving...
[10]

go read utils.py

skip — The skill is working well enough, or the evidence is too weak or ambiguous to justify changes. No action needed. Editing principles (for improve_skill) • Treat the CURRENT skill as the source of truth, not as a rough draft to be rewritten. • Read the original skill first, then the session evidence. • Default to targeted edits, not rewrites. • If mu...
[11]

Read all session files in sessions/
[12]

Analyze the sessions: identify patterns, failures, successes, and which skills (if any) were refer- enced
[13]

Decide what actions to take for each skill or pattern
[14]

all_success

Execute by writing new or updated SKILL.md files in skills/. Work through these steps autonomously. Use your file-reading and writing tools to inspect session data and produce skill files. File access boundary : All your file operations MUST stay within this workspace directory. The workspace contains copies of all data you need — sessions and skills have...
[15]

Start with _summary for a quick overview of each session
[16]

Use _trajectory when you need step-by-step detail (e.g., to identify exactly which tool call failed and why, or to see how a skill was used)
[17]

Use aggregate and _avg_prm for quantitative comparison across sessions
[18]

NOT for:

Use _skills_referenced to group sessions by skill for Step 2. Build a mental model of: • What task was the agent trying to accomplish? • Did the agent succeed or fail? Why? • Which skills were referenced? Did they help or not? • Are there common patterns across sessions? Step 2: Analyze & Aggregate Group sessions by the skills they referenced: • Skill gro...
[19]

Check whether skills/<skill-name>/history/ exists; if it does, list it to see all existing entries
[20]

If it exists, read every v*.md and v*_evidence.md file in that directory
[21]

Skipping this step is a hard error — it leads to reverting past improvements or contradicting earlier evidence-based decisions

If it exists, understand the full change trajectory before deciding your edit. Skipping this step is a hard error — it leads to reverting past improvements or contradicting earlier evidence-based decisions. History directory structure skills/<skill-name>/history/ v0_evidence.md ← why this skill was created (for create_skill) v1.md ← SKILL.md snaps...

2026
[22]

If no history exists, this is round 1

Check skills/<skill-name>/history/ to determine the current round N. If no history exists, this is round 1
[23]

Copy the current SKILL.md content verbatim to history/v<N>.md
[24]

Write history/v<N>_evidence.md noting: • Which sessions drove this change (session IDs, task IDs, PRM scores, success/fail counts, tool errors, repeated failure patterns) • What the positive/negative signals were • What previous history entries you read and how they informed this edit • How the old version performed in the available session evidence • Whi...
[25]

Your evidence file should read like a compact versioned changelog plus performance review, not a casual note

Then edit SKILL.md. Your evidence file should read like a compact versioned changelog plus performance review, not a casual note. Make it easy for a future agent to answer: • Why did version v<N> need to change? • What evidence from current sessions supports the next edit? • How did prior versions appear to perform in historical sessions? • Which modifica...
[26]

Decision summary • action type • target skill • why change is needed now
[27]

Session evidence • relevant session IDs / task IDs • representative PRM scores or aggregate metrics • recurring tool failures / observations
[28]

Historical comparison • what previous version(s) attempted • whether later evidence suggests those edits improved outcomes, regressed outcomes, or remain inconclusive
[29]

Edit plan • exact parts of the skill being changed • exact parts intentionally preserved
[30]

go inspect the source code

Open questions • uncertainty that future rounds should monitor History persistence depends on fresh mode • In --no-fresh mode, the server refreshes SKILL.md from storage each round but does NOT clear the history/ subdirectory. History therefore accumulates across rounds and serves as a continuous audit trail. • In --fresh mode, the workspace is rebuilt fr...