SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

Dengbo He; Jiayin Zhu; Kelong Mao; Simiu Gu; Sulong Xu; Yudong Guo; Yutao Yue

arxiv: 2607.01874 · v1 · pith:X2ZV5PK5new · submitted 2026-07-02 · 💻 cs.AI · cs.CL

SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use

Jiayin Zhu , Kelong Mao , Yudong Guo , Dengbo He , Sulong Xu , Simiu Gu , Yutao Yue This is my paper

Pith reviewed 2026-07-03 14:16 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords agentic skillsself-evolving rubricsprocess evaluationLLM agentstrajectory supervisionskill selectionskill compositionprocess vs outcome

0 comments

The pith

SkillCoach derives self-evolving rubrics from rollouts to evaluate agent skill-use on process dimensions distinct from final success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SkillCoach creates rubrics that assess how agents select, follow, compose, and reflect on skills during task execution. These rubrics come from actual agent rollouts and evolve over time. The approach treats final task success as a separate signal from the quality of the skill-use process. This distinction helps reveal failures that would otherwise go unnoticed when only checking the end result. The rubrics also guide the selection of better training examples for improving agent performance.

Core claim

SkillCoach derives skill-grounded process rubrics from real rollouts and evaluates trajectories along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It keeps the external verifier as a separate outcome signal, allowing process quality to be distinguished from accidental task success. The evolved rubrics further serve as process supervision for selecting high-quality training trajectories.

What carries the argument

Self-evolving rubrics derived from rollouts that score trajectories on four process dimensions and supply process supervision signals separate from outcome verification.

If this is right

Evolved rubrics substantially improve evaluation quality over final accuracy alone.
They expose failures hidden by final accuracy.
They provide stronger supervision signals than outcome-only filtering for selecting training trajectories.
Process quality can be tracked independently of whether the task succeeds by chance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of process and outcome signals could apply to other agent evaluation settings that currently rely only on end results.
Self-evolution of rubrics might reduce the need for manual rubric design when new skills or domains appear.
Using process rubrics for filtering could produce training data that leads to agents less prone to trial-and-error behavior in multi-skill environments.

Load-bearing premise

Rubrics automatically derived from rollouts capture meaningful distinctions in the four process dimensions without systematic bias from the initial rollout distribution or the evolution process.

What would settle it

A side-by-side rating of the same trajectories by human experts where the evolved rubrics show no higher agreement with the experts than final accuracy alone does.

read the original abstract

Skills are becoming a reusable operational layer for LLM agents, encoding SOPs, domain rules, tool workflows, scripts, and validation routines. In realistic skill repositories, overlapping skills make reliable skill-use difficult. Final verifier success is too coarse for both evaluation and training, since an agent may pass through trial and error while selecting distractor skills, skipping required steps, composing workflows incorrectly or omitting final checks. We introduce SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use. SkillCoach derives skill-grounded process rubrics from real rollouts and evaluates trajectories along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It keeps the external verifier as a separate outcome signal, allowing process quality to be distinguished from accidental task success. The evolved rubrics further serve as process supervision for selecting high-quality training trajectories. Experiments show that evolved rubrics substantially improve evaluation quality, expose failures hidden by final accuracy, and provide stronger supervision signals than outcome-only filtering for enhancing agentic skill-use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillCoach gives a concrete way to add process-level rubrics to skill-using agents while keeping outcome verification separate, but the self-evolution step still needs evidence that it does not just echo the initial rollout distribution.

read the letter

The main takeaway is that this paper supplies a workable method for turning agent rollouts into evolving rubrics that score four process dimensions—skill selection, following, composition, and reflection—without folding the final task verifier into the score. That separation is the useful part.

The framework is new in tying self-evolution directly to those four dimensions and then feeding the rubrics back as process supervision for trajectory selection. The abstract makes clear that the external verifier stays independent, which avoids the usual collapse of process and outcome. That design choice is worth crediting.

The experiments are described only at a high level: evolved rubrics improve evaluation quality and give stronger training signals than outcome-only filtering. Without numbers, ablations, or details on how the evolution is anchored, it is impossible to judge effect size or whether the rubrics actually surface failures that final accuracy misses. The stress-test concern about bias from the starting rollout distribution is reasonable here; if the paper does not show controls or external validation for the rubric criteria, the improvements could be re-weighting of the same data rather than genuine discovery.

The work is aimed at people building and training LLM agents that draw from overlapping skill repositories. It is concrete enough and addresses a real bottleneck, so it deserves a serious referee even though the current evidence is thin. I would send it out for review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use in LLM agents. It derives skill-grounded process rubrics from real rollouts and evaluates trajectories along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. The external verifier is kept separate as an outcome signal. Experiments demonstrate that the evolved rubrics improve evaluation quality, expose failures hidden by final accuracy, and provide stronger supervision signals than outcome-only filtering.

Significance. If the results hold, SkillCoach could offer a valuable method for process-level evaluation and supervision in agentic systems, addressing the limitations of coarse final verifiers in environments with overlapping skills. The explicit separation of process and outcome signals is a positive design choice that allows distinguishing genuine skill-use quality from accidental success.

major comments (2)

[Experiments] The central claim that evolved rubrics capture meaningful distinctions in the four process dimensions without systematic bias from the initial rollout distribution requires stronger validation. The manuscript does not provide ablations or analyses showing that the rubric evolution is independent of the base agent's exploration policy or skill repository characteristics, which is load-bearing for the claim that improvements reflect discovery of hidden process failures rather than re-weighting of the original distribution (Experiments section).
[Method] Details on the self-evolution mechanism for rubrics are insufficient to assess whether criteria remain anchored to independent process quality or propagate skews from the initial trajectories used for both generation and evaluation (Method section).

minor comments (1)

[Abstract] The abstract could more explicitly state the base models, datasets, or skill repositories used in the experiments to provide context for the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments highlight important areas for strengthening the validation of our claims and the clarity of the method. We address each below and commit to revisions that directly respond to the concerns.

read point-by-point responses

Referee: [Experiments] The central claim that evolved rubrics capture meaningful distinctions in the four process dimensions without systematic bias from the initial rollout distribution requires stronger validation. The manuscript does not provide ablations or analyses showing that the rubric evolution is independent of the base agent's exploration policy or skill repository characteristics, which is load-bearing for the claim that improvements reflect discovery of hidden process failures rather than re-weighting of the original distribution (Experiments section).

Authors: We agree that demonstrating independence from the initial rollout distribution is important for the central claim. The current experiments compare evolved rubrics against outcome-only baselines and show improved detection of process failures, but they do not include the requested cross-policy or cross-repository ablations. In the revision we will add these analyses: we will re-run rubric evolution using trajectories from two additional base agents with different exploration policies and from a second skill repository, then measure whether the resulting rubrics yield consistent process-quality rankings and supervision gains. This will directly test whether improvements arise from re-weighting the original distribution or from discovery of generalizable process criteria. revision: yes
Referee: [Method] Details on the self-evolution mechanism for rubrics are insufficient to assess whether criteria remain anchored to independent process quality or propagate skews from the initial trajectories used for both generation and evaluation (Method section).

Authors: We acknowledge that the current Method section provides only a high-level description of the self-evolution loop. In the revision we will expand this section with (1) a detailed algorithm box showing the exact steps of rubric generation, scoring, and iterative refinement, (2) explicit discussion of how the external verifier remains an independent outcome signal that is never used to modify rubric criteria, and (3) concrete examples illustrating how a rubric criterion is updated only when multiple trajectories exhibit the same process pattern, thereby reducing the risk of propagating single-trajectory skews. These additions will allow readers to evaluate whether the mechanism stays anchored to process quality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained with external verification

full rationale

The paper derives rubrics from rollouts, evolves them, and applies them to distinguish process quality from outcome success while keeping the external verifier separate. Experiments then measure improvements in evaluation quality and supervision signals against baselines. No quoted step reduces a central claim (e.g., 'improved evaluation quality') to a fitted parameter or self-citation by construction; the four process dimensions are evaluated via the evolved rubrics but validated externally rather than tautologically. This is the normal case of an independent experimental pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that rollout-derived rubrics can be meaningfully evolved and applied without additional human annotation.

pith-pipeline@v0.9.1-grok · 5730 in / 990 out tokens · 20186 ms · 2026-07-03T14:16:19.695956+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 23 canonical work pages · 20 internal anchors

[1]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Yimin Liu, Wenbo Chen, Bingran You, Zonglin Di, Yifeng He, Shenghan Zheng, Kyoung Whan Choe, Jiankai Sun, Shuyi Wang, Chujun Tao, Binxu Li, Xuandong Zhao, Hejia Geng, Xiaojun Wu, Junwei Zhou, Xiaokun Chen, Hanwen Xing, Yubo Li, Qunhong Zeng, Di Wang, Yuanli Wang, Roey Ben Chaim, Penghao Jiang, Haotian Shen, Luyang Kong, Xinyi Liu, Runhui Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, and Chenyan Xiong. Skilllearnbench: Benchmarking continual learning methods for agent skill generation on real-world tasks. 2026. URLhttps://arxiv.org/abs/2604.20087

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, QingnanRen, ShunZou, WenxuanHuang, LinChen, ZehuiChen, andFengZhao. SkillFlow: Benchmarking lifelong skill discovery and evolution for autonomous agents. arXiv preprint arXiv:2604.17308, 2026. URL https://arxiv.org/abs/2604.17308

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026. URLhttps://arxiv.org/abs/2602.12430

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. EvoSkill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026. URLhttps://arxiv.org/abs/2603. 02766

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, and Chong Luo. SkillOpt: Executive strategy for self-evolving agent skills.arXiv preprint arXiv:2605.23904, 2026. URLhttps://arxiv.org/abs/2605.23904

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Agent-as-a-Judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-Judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934, 2024. URL https://arxiv.org/abs/2410.10934

work page arXiv 2024
[8]

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, and Yankai Lin. AgentProcessBench: Diagnosing step-level process quality in tool-using agents. arXiv preprint arXiv:2603.14465, 2026. URLhttps://arxiv.org/abs/2603.14465

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

ToolPRMBench: Evaluating and advancing process reward models for tool-using agents.arXiv preprint arXiv:2601.12294, 2026

Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, and Ruocheng Guo. ToolPRMBench: Evaluating and advancing process reward models for tool-using agents.arXiv preprint arXiv:2601.12294, 2026. URLhttps://arxiv.org/ abs/2601.12294

work page arXiv 2026
[10]

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

Liang Ding. AdaRubric: Task-adaptive rubrics for reliable LLM agent evaluation and reward learning.arXiv preprint arXiv:2603.21362, 2026. URLhttps://arxiv.org/abs/2603.21362. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao and Chris Callison-Burch. Autorubric: Unifying rubric-based LLM evaluation. arXiv preprint arXiv:2603.00077, 2026. URLhttps://arxiv.org/abs/2603.00077

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

CUA- Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123, 2026

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida. CUA- Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123, 2026. URLhttps://arxiv.org/ abs/2601.21123

work page arXiv 2026
[13]

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

M. Danish Lim, I. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, and Laura Wynter. Declarative skills for AI agents in knowledge-grounded tool-use workflows.arXiv preprint arXiv:2606.06923, 2026. URLhttps: //arxiv.org/abs/2606.06923

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking llm skill usage in realistic settings, 2026. URLhttps://arxiv.org/abs/2604.04323

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

SkillGen: Verified Inference-Time Agent Skill Synthesis

Yuchen Ma, Yue Huang, Han Bao, Haomin Zhuang, Swadheen Shukla, Michel Galley, Xiangliang Zhang, and Stefan Feuerriegel. SkillGen: Verified inference-time agent skill synthesis.arXiv preprint arXiv:2605.10999, 2026. URLhttps://arxiv.org/abs/2605.10999

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement

Srishti Gautam, Arjun Radhakrishna, and Sumit Gulwani. SkillAxe: Sharpening LLM-authored agent skills through evaluation-guided self-refinement. arXiv preprint arXiv:2606.10546, 2026. URLhttps://arxiv.org/ abs/2606.10546

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. CoEvoSkills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026. URLhttps://arxiv.org/abs/ 2604.01687

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. SKILL- FOUNDRY: Building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964, 2026. URLhttps://arxiv.org/abs/2604.03964

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, and Tieying Zhang. MUSE-Autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026. URLhttps://arxiv. org/abs/2605.27366

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library. arXiv preprint arXiv:2512.17102, 2025. URLhttps://arxiv.org/abs/2512.17102

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill- augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026. URLhttps://arxiv.org/abs/2602. 08234

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

HaozhenZhang, QuanyuLong, JianzhuBao, TaoFeng, WeizhiZhang, HaodongYue, andWenyaWang. MemSkill: Learning and evolving memory skills for self-evolving agents. arXiv preprint arXiv:2602.02474, 2026. URL https://arxiv.org/abs/2602.02474

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Counterfactual Trace Auditing of LLM Agent Skills

Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, and Xiyang Hu. Counterfactual trace auditing of LLM agent skills. arXiv preprint arXiv:2605.11946, 2026. URLhttps://arxiv.org/abs/2605.11946. 14 Category Task Skill Library Data Gold Distr. Inst. Training Tasks Software Engineering software-dependency-audit 3 5 3 fix-security-bug 1 5 1 fix-erlang-ssh-cve 6 5 ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

skill_selection (GATE): did the agent select the required gold skill(s) and avoid distractor skills? In a no-gold setting, did it correctly refuse to use a skill? If this fails, downstream dimensions are discounted
[25]

not needed

skill_following: did the agent actually perform the skill’s KEY STEPS (not just name the skill)? Steps marked "not needed" for this instance do not count against coverage. 17
[26]

skill_composition_order: for multi-skill / multi-step tasks, are the step ORDER and the passing of intermediate artifacts between skills correct? If the task has a single gold skill this dimension is not_applicable
[27]

key_steps

result_reflection: before finishing, did the agent do an EXPLICIT, visible self-check / verification / reflection of its result? Only visible behavior counts; never assume hidden reasoning. verifier: the task’s hard verifier result. This is an external outcome signal produced by a rule runner, not an LLM judgment, and it is not part of the process meta sc...
[28]

Return at least 4 key steps unless the task is truly trivial. 18
[29]

At least one key step must come from the gold skill content
[30]

At least one key step must be tied to the final artifact or verifier requirement
[31]

Every critical step must include positive_evidence and negative_evidence
[32]

Each description must describe an action that can be checked against tool calls, messages, files, commands, or artifacts
[33]

rubric_id

Do not infer hidden reasoning. Only visible trajectory evidence counts. Return ONLY the JSON object. USER PROMPT TEMPLATE Extract evidence-checkable key steps for task ‘<task_id>‘. <TASK_INSTRUCTION> <task instruction redacted> </TASK_INSTRUCTION> <FULL_GOLD_SKILL_PACK::<skill_name>> <full gold SKILL.md content redacted> </FULL_GOLD_SKILL_PACK::<skill_nam...
[34]

skill_selection must name the actual gold skills and distractor skills
[35]

skill_selection must distinguish real skill use (SKILL.md was read) from merely mentioning a skill name
[36]

skill_following.key_steps must be exactly the provided EXTRACTED_KEY_STEPS
[37]

skill_following criteria and score_rules must refer to key step IDs
[38]

If there is only one gold skill, set skill_composition_order.applicable to false
[39]

If there are multiple gold skills or ordered substeps, fill expected_order, dependencies, and handoff_requirements
[40]

result_reflection only counts visible self-checking behavior
[41]

verifier is not judged by an LLM; it comes from the hard benchmark verifier
[42]

Launching skill:

Sample real rollouts are only for trajectory format and common mistakes. They are not labels. Return ONLY the JSON rubric. USER PROMPT TEMPLATE Generate the R0 rubric for task ‘<task_id>‘. <TASK_INSTRUCTION> <task instruction redacted> </TASK_INSTRUCTION> <TASK_PACKAGE> <task package JSON redacted> </TASK_PACKAGE> <GOLD_SKILL::<skill_name>> <full gold SKI...
[43]

correct means all required gold skills were selected and no harmful distractor was used
[44]

partial means a gold skill was read or invoked, but distractor evidence also appears
[45]

wrong means the agent mainly selected a distractor or used the wrong skill path
[46]

missing means no gold skill evidence is found
[47]

false_trigger is true when the agent uses a skill in a no-gold setting or forces an irrelevant skill
[48]

dimension

Every positive judgment must cite event_index evidence. INPUT TEMPLATE <EVENT_INDEXED_TIMELINE> <compact timeline with event_index retained> </EVENT_INDEXED_TIMELINE> <SKILL_EVENTS> <skill event JSON, if present> </SKILL_EVENTS> <GOLD_SKILLS> <gold skill names> </GOLD_SKILLS> <DISTRACTOR_SKILLS> <distractor skill names> </DISTRACTOR_SKILLS> Prompt 4: Skil...
[49]

Do not output critical_step_coverage
[50]

The code will compute score and coverage later
[51]

completed and partial require at least one event_index evidence item
[52]

missing may have empty evidence
[53]

not_needed must cite the key step’s optional_condition
[54]

schema_version

If there is no gold skill invocation evidence, do not mark critical skill-specific steps as completed. Return ONLY that JSON object. USER PROMPT TEMPLATE Judge dimension ‘skill_following‘ for a trajectory on task ‘<task_id>‘. <DIMENSION_RUBRIC> <skill_following rubric JSON, including key_steps> </DIMENSION_RUBRIC> <GOLD_SKILLS> <gold skill names> </GOLD_S...
[55]

First infer observed_order from the trajectory
[56]

Compare observed_order with expected_order
[57]

Check whether each dependency’s artifact was produced before it was consumed
[58]

Check whether handoff_requirements are satisfied
[59]

If there is only one gold skill and no ordered dependencies, return score 1.0 and order_correct=null
[60]

dimension

Cite event_index evidence for every error. Return ONLY that JSON object. USER PROMPT TEMPLATE Judge dimension ‘skill_composition_order‘ for a trajectory on task ‘<task_id>‘. <DIMENSION_RUBRIC> <composition rubric JSON with expected_order, dependencies, and handoff requirements> </DIMENSION_RUBRIC> <GOLD_SKILLS> <gold skill names> </GOLD_SKILLS> <DISTRACTO...

[1] [1]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Yimin Liu, Wenbo Chen, Bingran You, Zonglin Di, Yifeng He, Shenghan Zheng, Kyoung Whan Choe, Jiankai Sun, Shuyi Wang, Chujun Tao, Binxu Li, Xuandong Zhao, Hejia Geng, Xiaojun Wu, Junwei Zhou, Xiaokun Chen, Hanwen Xing, Yubo Li, Qunhong Zeng, Di Wang, Yuanli Wang, Roey Ben Chaim, Penghao Jiang, Haotian Shen, Luyang Kong, Xinyi Liu, Runhui Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, and Chenyan Xiong. Skilllearnbench: Benchmarking continual learning methods for agent skill generation on real-world tasks. 2026. URLhttps://arxiv.org/abs/2604.20087

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, QingnanRen, ShunZou, WenxuanHuang, LinChen, ZehuiChen, andFengZhao. SkillFlow: Benchmarking lifelong skill discovery and evolution for autonomous agents. arXiv preprint arXiv:2604.17308, 2026. URL https://arxiv.org/abs/2604.17308

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026. URLhttps://arxiv.org/abs/2602.12430

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. EvoSkill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026. URLhttps://arxiv.org/abs/2603. 02766

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, and Chong Luo. SkillOpt: Executive strategy for self-evolving agent skills.arXiv preprint arXiv:2605.23904, 2026. URLhttps://arxiv.org/abs/2605.23904

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Agent-as-a-Judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. Agent-as-a-Judge: Evaluate agents with agents. arXiv preprint arXiv:2410.10934, 2024. URL https://arxiv.org/abs/2410.10934

work page arXiv 2024

[8] [8]

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, and Yankai Lin. AgentProcessBench: Diagnosing step-level process quality in tool-using agents. arXiv preprint arXiv:2603.14465, 2026. URLhttps://arxiv.org/abs/2603.14465

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

ToolPRMBench: Evaluating and advancing process reward models for tool-using agents.arXiv preprint arXiv:2601.12294, 2026

Dawei Li, Yuguang Yao, Zhen Tan, Huan Liu, and Ruocheng Guo. ToolPRMBench: Evaluating and advancing process reward models for tool-using agents.arXiv preprint arXiv:2601.12294, 2026. URLhttps://arxiv.org/ abs/2601.12294

work page arXiv 2026

[10] [10]

AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

Liang Ding. AdaRubric: Task-adaptive rubrics for reliable LLM agent evaluation and reward learning.arXiv preprint arXiv:2603.21362, 2026. URLhttps://arxiv.org/abs/2603.21362. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Autorubric: Unifying Rubric-based LLM Evaluation

Delip Rao and Chris Callison-Burch. Autorubric: Unifying rubric-based LLM evaluation. arXiv preprint arXiv:2603.00077, 2026. URLhttps://arxiv.org/abs/2603.00077

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

CUA- Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123, 2026

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida. CUA- Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123, 2026. URLhttps://arxiv.org/ abs/2601.21123

work page arXiv 2026

[13] [13]

Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows

M. Danish Lim, I. Danial Bin Sharudin, Wen Han Chen, Cedric Lim, and Laura Wynter. Declarative skills for AI agents in knowledge-grounded tool-use workflows.arXiv preprint arXiv:2606.06923, 2026. URLhttps: //arxiv.org/abs/2606.06923

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking llm skill usage in realistic settings, 2026. URLhttps://arxiv.org/abs/2604.04323

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

SkillGen: Verified Inference-Time Agent Skill Synthesis

Yuchen Ma, Yue Huang, Han Bao, Haomin Zhuang, Swadheen Shukla, Michel Galley, Xiangliang Zhang, and Stefan Feuerriegel. SkillGen: Verified inference-time agent skill synthesis.arXiv preprint arXiv:2605.10999, 2026. URLhttps://arxiv.org/abs/2605.10999

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement

Srishti Gautam, Arjun Radhakrishna, and Sumit Gulwani. SkillAxe: Sharpening LLM-authored agent skills through evaluation-guided self-refinement. arXiv preprint arXiv:2606.10546, 2026. URLhttps://arxiv.org/ abs/2606.10546

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei- Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. CoEvoSkills: Self-evolving agent skills via co-evolutionary verification. arXiv preprint arXiv:2604.01687, 2026. URLhttps://arxiv.org/abs/ 2604.01687

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. SKILL- FOUNDRY: Building self-evolving agent skill libraries from heterogeneous scientific resources.arXiv preprint arXiv:2604.03964, 2026. URLhttps://arxiv.org/abs/2604.03964

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, and Tieying Zhang. MUSE-Autoskill: Self-evolving agents via skill creation, memory, management, and evaluation.arXiv preprint arXiv:2605.27366, 2026. URLhttps://arxiv. org/abs/2605.27366

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Reinforcement Learning for Self-Improving Agent with Skill Library

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library. arXiv preprint arXiv:2512.17102, 2025. URLhttps://arxiv.org/abs/2512.17102

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill- augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026. URLhttps://arxiv.org/abs/2602. 08234

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

HaozhenZhang, QuanyuLong, JianzhuBao, TaoFeng, WeizhiZhang, HaodongYue, andWenyaWang. MemSkill: Learning and evolving memory skills for self-evolving agents. arXiv preprint arXiv:2602.02474, 2026. URL https://arxiv.org/abs/2602.02474

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Counterfactual Trace Auditing of LLM Agent Skills

Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, and Xiyang Hu. Counterfactual trace auditing of LLM agent skills. arXiv preprint arXiv:2605.11946, 2026. URLhttps://arxiv.org/abs/2605.11946. 14 Category Task Skill Library Data Gold Distr. Inst. Training Tasks Software Engineering software-dependency-audit 3 5 3 fix-security-bug 1 5 1 fix-erlang-ssh-cve 6 5 ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

skill_selection (GATE): did the agent select the required gold skill(s) and avoid distractor skills? In a no-gold setting, did it correctly refuse to use a skill? If this fails, downstream dimensions are discounted

[25] [25]

not needed

skill_following: did the agent actually perform the skill’s KEY STEPS (not just name the skill)? Steps marked "not needed" for this instance do not count against coverage. 17

[26] [26]

skill_composition_order: for multi-skill / multi-step tasks, are the step ORDER and the passing of intermediate artifacts between skills correct? If the task has a single gold skill this dimension is not_applicable

[27] [27]

key_steps

result_reflection: before finishing, did the agent do an EXPLICIT, visible self-check / verification / reflection of its result? Only visible behavior counts; never assume hidden reasoning. verifier: the task’s hard verifier result. This is an external outcome signal produced by a rule runner, not an LLM judgment, and it is not part of the process meta sc...

[28] [28]

Return at least 4 key steps unless the task is truly trivial. 18

[29] [29]

At least one key step must come from the gold skill content

[30] [30]

At least one key step must be tied to the final artifact or verifier requirement

[31] [31]

Every critical step must include positive_evidence and negative_evidence

[32] [32]

Each description must describe an action that can be checked against tool calls, messages, files, commands, or artifacts

[33] [33]

rubric_id

Do not infer hidden reasoning. Only visible trajectory evidence counts. Return ONLY the JSON object. USER PROMPT TEMPLATE Extract evidence-checkable key steps for task ‘<task_id>‘. <TASK_INSTRUCTION> <task instruction redacted> </TASK_INSTRUCTION> <FULL_GOLD_SKILL_PACK::<skill_name>> <full gold SKILL.md content redacted> </FULL_GOLD_SKILL_PACK::<skill_nam...

[34] [34]

skill_selection must name the actual gold skills and distractor skills

[35] [35]

skill_selection must distinguish real skill use (SKILL.md was read) from merely mentioning a skill name

[36] [36]

skill_following.key_steps must be exactly the provided EXTRACTED_KEY_STEPS

[37] [37]

skill_following criteria and score_rules must refer to key step IDs

[38] [38]

If there is only one gold skill, set skill_composition_order.applicable to false

[39] [39]

If there are multiple gold skills or ordered substeps, fill expected_order, dependencies, and handoff_requirements

[40] [40]

result_reflection only counts visible self-checking behavior

[41] [41]

verifier is not judged by an LLM; it comes from the hard benchmark verifier

[42] [42]

Launching skill:

Sample real rollouts are only for trajectory format and common mistakes. They are not labels. Return ONLY the JSON rubric. USER PROMPT TEMPLATE Generate the R0 rubric for task ‘<task_id>‘. <TASK_INSTRUCTION> <task instruction redacted> </TASK_INSTRUCTION> <TASK_PACKAGE> <task package JSON redacted> </TASK_PACKAGE> <GOLD_SKILL::<skill_name>> <full gold SKI...

[43] [43]

correct means all required gold skills were selected and no harmful distractor was used

[44] [44]

partial means a gold skill was read or invoked, but distractor evidence also appears

[45] [45]

wrong means the agent mainly selected a distractor or used the wrong skill path

[46] [46]

missing means no gold skill evidence is found

[47] [47]

false_trigger is true when the agent uses a skill in a no-gold setting or forces an irrelevant skill

[48] [48]

dimension

Every positive judgment must cite event_index evidence. INPUT TEMPLATE <EVENT_INDEXED_TIMELINE> <compact timeline with event_index retained> </EVENT_INDEXED_TIMELINE> <SKILL_EVENTS> <skill event JSON, if present> </SKILL_EVENTS> <GOLD_SKILLS> <gold skill names> </GOLD_SKILLS> <DISTRACTOR_SKILLS> <distractor skill names> </DISTRACTOR_SKILLS> Prompt 4: Skil...

[49] [49]

Do not output critical_step_coverage

[50] [50]

The code will compute score and coverage later

[51] [51]

completed and partial require at least one event_index evidence item

[52] [52]

missing may have empty evidence

[53] [53]

not_needed must cite the key step’s optional_condition

[54] [54]

schema_version

If there is no gold skill invocation evidence, do not mark critical skill-specific steps as completed. Return ONLY that JSON object. USER PROMPT TEMPLATE Judge dimension ‘skill_following‘ for a trajectory on task ‘<task_id>‘. <DIMENSION_RUBRIC> <skill_following rubric JSON, including key_steps> </DIMENSION_RUBRIC> <GOLD_SKILLS> <gold skill names> </GOLD_S...

[55] [55]

First infer observed_order from the trajectory

[56] [56]

Compare observed_order with expected_order

[57] [57]

Check whether each dependency’s artifact was produced before it was consumed

[58] [58]

Check whether handoff_requirements are satisfied

[59] [59]

If there is only one gold skill and no ordered dependencies, return score 1.0 and order_correct=null

[60] [60]

dimension

Cite event_index evidence for every error. Return ONLY that JSON object. USER PROMPT TEMPLATE Judge dimension ‘skill_composition_order‘ for a trajectory on task ‘<task_id>‘. <DIMENSION_RUBRIC> <composition rubric JSON with expected_order, dependencies, and handoff requirements> </DIMENSION_RUBRIC> <GOLD_SKILLS> <gold skill names> </GOLD_SKILLS> <DISTRACTO...