pith. sign in

hub

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it
abstract

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

hub tools

years

2026 11

verdicts

UNVERDICTED 11

clear filters

representative citing papers

SoftSkill: Behavioral Compression for Contextual Adaptation

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

SoftSkill compresses agent skills into length-32 continuous prefixes via next-token training of soft deltas, yielding 5.2-12.5 point gains over SkillOpt on SearchQA and LiveMath while using far fewer tokens.

A Framework for Evaluating Agentic Skills at Scale

cs.SE · 2026-06-16 · unverdicted · novelty 6.0

The authors developed an evaluation framework that generates 1000 tasks from 500 real-world agent skills, applies instruction-following and goal-completion rubrics, and benchmarks 19 proprietary and open-source model configurations.

LemonHarness Technical Report

cs.AI · 2026-06-23 · unverdicted · novelty 5.0

LemonHarness constrains LLM agent state changes to a defined workspace, supplies callable rule knowledge, and adds time awareness, yielding 84.49% and 86.52% accuracy on Terminal-Bench 2.0 with two GPT-5 backbones.

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

cs.LG · 2026-06-18 · unverdicted · novelty 5.0

MAA formalizes alignability and comparability conditions and uses differential signals, EMA accumulation, and semantic identity merging to enable cross-batch operation-level evidence accumulation, outperforming batch-level baselines in 14 of 16 settings while matching online methods.

citing papers explorer

Showing 1 of 1 citing paper after filters.