More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

Hongwen Song; Song Wei

arxiv: 2605.24050 · v2 · pith:3GZ56ZTYnew · submitted 2026-05-21 · 💻 cs.SE · cs.AI· stat.AP

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

Hongwen Song , Song Wei This is my paper

Pith reviewed 2026-06-30 15:40 UTC · model grok-4.3

classification 💻 cs.SE cs.AIstat.AP

keywords skill librariesLLM agentsperformance degradationskill shadowingcontext overheadskill selectionpass rate drop

0 comments

The pith

Skill shadowing from wrong selections, not context overhead, drives most performance loss as LLM skill libraries grow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the observed decline in LLM agent success rates when skill libraries expand from small sets of helpful skills to libraries of 202 skills, with drops reaching 21 percent. It decomposes this pass rate drop by conditioning on the skills an agent actually invokes, separating skill shadowing—where agents choose incorrect skills more often in larger libraries—from context overhead, where correct selections suffer due to longer prompts. Empirical estimates and derived upper bounds show that skill shadowing grows with library size and accounts for the majority of the degradation, while context overhead remains small and statistically indistinguishable from zero. This distinction matters for building reliable skill libraries that let non-experts handle domain tasks through natural language instructions.

Core claim

By conditioning on the skill or skills invoked during each trajectory, the pass rate drop between a small library of known-helpful skills and the full library can be separated into skill shadowing, which increases with library size, and context overhead, which does not. Empirical measurements and their upper bounds both indicate that skill shadowing contributes substantially to the observed degradation while context overhead contributes negligibly, establishing skill selection failure as the primary bottleneck when libraries expand.

What carries the argument

Decomposition of the pass rate drop by conditioning on invoked skills, which isolates skill shadowing (increased rate of incorrect selections) from context overhead (execution degradation with correct selections) and supplies upper bounds on each.

If this is right

Skill selection accuracy must be improved to prevent further degradation as libraries scale.
Context length alone does not explain the performance losses seen in expanded libraries.
Upper bounds on the two effects confirm that shadowing grows while overhead stays near zero.
The asymmetry between the effects holds across the tested library sizes up to 202 skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Library designers may benefit from mechanisms that rank or filter skills before presenting them to the agent.
Hierarchical or retrieval-based skill access could reduce shadowing without shrinking the total library.
Testing whether agents with explicit verification steps before execution show reduced shadowing would extend the decomposition.

Load-bearing premise

The overall pass rate drop can be validly split into skill shadowing and context overhead effects by conditioning on which skills the agent actually invokes during a trajectory.

What would settle it

An experiment that forces agents to invoke only correct skills across both small and large libraries and still measures a large pass rate drop attributable to context length would falsify the claim that shadowing dominates.

Figures

Figures reproduced from arXiv: 2605.24050 by Hongwen Song, Song Wei.

read the original abstract

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Skill shadowing from poor selection, not context length, drives most of the performance drop when LLM agents scale their skill libraries.

read the letter

The main result is that performance falls as skill libraries grow because agents invoke the wrong skills more often, while the extra context from a larger library has almost no measurable effect on execution once the right skill is picked.

The paper sets up the drop as the pass-rate difference between a small helpful library and the full 202-skill set. It then splits the drop by conditioning on whether the agent actually invokes a correct skill during a run. This lets them separate skill shadowing (wrong choices) from context overhead (worse execution even with the right skill). They give upper bounds on each component and report empirical estimates showing shadowing grows with library size and accounts for nearly all the loss, while overhead stays near zero. That decomposition is the concrete new piece; it turns a known scaling complaint into two measurable effects with a clear winner.

The approach is straightforward and the asymmetry in the estimates is useful for anyone working on tool-using agents. It correctly focuses attention on selection mechanisms rather than context compression tricks.

The conditioning step does raise the selection-bias issue flagged in the stress test. Tasks where the agent still picks correctly in the large library may be easier or have stronger cues, so the conditional comparison could understate context overhead on the original task distribution. The abstract does not show whether the bounds correct for this, so that needs checking in the full methods. If the bias is small in their data it is minor; if not, it weakens the claim that overhead is negligible.

This is for researchers building or evaluating skill-augmented LLM agents. Anyone who cares about practical scaling limits will find the distinction worth testing. The core claim is internally consistent and points to a real bottleneck, so the paper deserves a serious referee to verify the experiments and the bias question.

Referee Report

1 major / 0 minor

Summary. The paper claims that LLM agent performance degrades with larger skill libraries (up to 21% pass-rate drop from small helpful sets to a 202-skill library). It decomposes this drop, via conditioning on the skills invoked during trajectories, into skill shadowing (higher rate of incorrect skill selection) and context overhead (execution degradation even with correct selection). Upper bounds are derived for both effects; empirical estimates indicate skill shadowing grows with library size and drives most of the degradation, while context overhead remains small and statistically indistinguishable from zero.

Significance. If the decomposition is valid, the result identifies skill selection failure as the dominant scaling bottleneck for skill-library agents and suggests that context-length management is secondary. The provision of both theoretical upper bounds and empirical estimates is a methodological strength that could inform targeted improvements in agent architectures.

major comments (1)

[Abstract (decomposition paragraph) and the section deriving the bounds] The central decomposition (conditioning on correct skill invocation to isolate context overhead) is load-bearing for the claim that context overhead is negligible. Because invocation success is a post-treatment variable whose probability decreases with library size, the conditional comparison is performed on a selected subset of tasks (those retaining high invocation probability in the large library). These tasks are likely easier or have stronger skill cues, which can downward-bias the measured context-overhead effect and thereby overstate the relative contribution of skill shadowing. The upper-bound derivations do not appear to adjust for this selection mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a potential selection bias arising from conditioning on post-treatment invocation success. We address the concern directly below.

read point-by-point responses

Referee: [Abstract (decomposition paragraph) and the section deriving the bounds] The central decomposition (conditioning on correct skill invocation to isolate context overhead) is load-bearing for the claim that context overhead is negligible. Because invocation success is a post-treatment variable whose probability decreases with library size, the conditional comparison is performed on a selected subset of tasks (those retaining high invocation probability in the large library). These tasks are likely easier or have stronger skill cues, which can downward-bias the measured context-overhead effect and thereby overstate the relative contribution of skill shadowing. The upper-bound derivations do not appear to adjust for this selection mechanism.

Authors: We acknowledge that conditioning on successful invocation (a post-treatment variable) selects a non-random subset of tasks whose invocation probability remains high even in the large library; these tasks may indeed be easier or possess stronger cues, which can downward-bias the conditional estimate of context overhead. Our upper bounds on both effects, however, are derived from the unconditional pass-rate differences between the small and full libraries and do not rely on the conditional comparison; they therefore remain valid even after accounting for selection. The empirical conditional estimates are presented only as a descriptive decomposition, not as the sole basis for the negligibility claim. We will revise the relevant sections to explicitly discuss this selection mechanism, its direction of bias, and the role of the unconditional upper bounds in bounding the effects. revision: partial

Circularity Check

0 steps flagged

No significant circularity; decomposition is definitional with independent empirical estimates

full rationale

The paper defines the two effects explicitly via conditioning on invocation and then reports separate empirical estimates plus upper bounds computed from observed trajectories. No equations, fitted parameters, or self-citations are shown that reduce the central claims (growth of shadowing vs. negligible overhead) to inputs by construction. The estimates remain data-driven quantities rather than tautological renamings or self-referential fits. This is the normal non-circular case for an empirical decomposition paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based on abstract only; the decomposition itself rests on a domain assumption about separability of selection and context effects, with two new conceptual effects introduced without independent evidence outside the paper.

axioms (1)

domain assumption The observed pass rate drop between a helpful-skill library and the full library can be decomposed by conditioning on the skill(s) invoked during a trajectory.
This separability assumption is required to isolate skill shadowing from context overhead and derive separate upper bounds.

invented entities (2)

skill shadowing effect no independent evidence
purpose: Quantifies increased rate of incorrect skill selection as library size grows
Newly defined component of the decomposition; no external falsifiable handle provided in abstract.
context overhead effect no independent evidence
purpose: Quantifies execution degradation from enlarged context even with correct skill selection
Newly defined component of the decomposition; no external falsifiable handle provided in abstract.

pith-pipeline@v0.9.1-grok · 5761 in / 1295 out tokens · 51337 ms · 2026-06-30T15:40:35.953130+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
cs.AI 2026-06 unverdicted novelty 6.0

Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.

Reference graph

Works this paper leans on

7 extracted references · 6 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,

Tiantian Gan and Qiyao Sun. RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,

work page arXiv
[2]

SkillReducer: Optimizing LLM Agent Skills for Token Efficiency

Yudong Gao, Zongjie Li, Zimo Ji, Pingchuan Ma, Shuai Wang, et al. SkillReducer: Optimizing LLM agent skills for token efficiency.arXiv preprint arXiv:2603.29919,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. SoK: Agentic skills—beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

10 Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking LLM skill usage in realistic settings.arXiv preprint arXiv:2604.04323,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

break the bundle structure

11 Appendix Table of Contents A Experiment Configurations 12 A.1 Tasks included in our experiments . . . . . . . . . . . . . . . . . . . . . . 12 A.2 Library construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.4 Examples with skill shadowing ...

2026

[1] [1]

RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,

Tiantian Gan and Qiyao Sun. RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,

work page arXiv

[2] [2]

SkillReducer: Optimizing LLM Agent Skills for Token Efficiency

Yudong Gao, Zongjie Li, Zimo Ji, Pingchuan Ma, Shuai Wang, et al. SkillReducer: Optimizing LLM agent skills for token efficiency.arXiv preprint arXiv:2603.29919,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. SoK: Agentic skills—beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

10 Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking LLM skill usage in realistic settings.arXiv preprint arXiv:2604.04323,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

break the bundle structure

11 Appendix Table of Contents A Experiment Configurations 12 A.1 Tasks included in our experiments . . . . . . . . . . . . . . . . . . . . . . 12 A.2 Library construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.4 Examples with skill shadowing ...

2026