More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries
Pith reviewed 2026-06-30 15:40 UTC · model grok-4.3
The pith
Skill shadowing from wrong selections, not context overhead, drives most performance loss as LLM skill libraries grow.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By conditioning on the skill or skills invoked during each trajectory, the pass rate drop between a small library of known-helpful skills and the full library can be separated into skill shadowing, which increases with library size, and context overhead, which does not. Empirical measurements and their upper bounds both indicate that skill shadowing contributes substantially to the observed degradation while context overhead contributes negligibly, establishing skill selection failure as the primary bottleneck when libraries expand.
What carries the argument
Decomposition of the pass rate drop by conditioning on invoked skills, which isolates skill shadowing (increased rate of incorrect selections) from context overhead (execution degradation with correct selections) and supplies upper bounds on each.
If this is right
- Skill selection accuracy must be improved to prevent further degradation as libraries scale.
- Context length alone does not explain the performance losses seen in expanded libraries.
- Upper bounds on the two effects confirm that shadowing grows while overhead stays near zero.
- The asymmetry between the effects holds across the tested library sizes up to 202 skills.
Where Pith is reading between the lines
- Library designers may benefit from mechanisms that rank or filter skills before presenting them to the agent.
- Hierarchical or retrieval-based skill access could reduce shadowing without shrinking the total library.
- Testing whether agents with explicit verification steps before execution show reduced shadowing would extend the decomposition.
Load-bearing premise
The overall pass rate drop can be validly split into skill shadowing and context overhead effects by conditioning on which skills the agent actually invokes during a trajectory.
What would settle it
An experiment that forces agents to invoke only correct skills across both small and large libraries and still measures a large pass rate drop attributable to context length would falsify the claim that shadowing dominates.
Figures
read the original abstract
Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM agent performance degrades with larger skill libraries (up to 21% pass-rate drop from small helpful sets to a 202-skill library). It decomposes this drop, via conditioning on the skills invoked during trajectories, into skill shadowing (higher rate of incorrect skill selection) and context overhead (execution degradation even with correct selection). Upper bounds are derived for both effects; empirical estimates indicate skill shadowing grows with library size and drives most of the degradation, while context overhead remains small and statistically indistinguishable from zero.
Significance. If the decomposition is valid, the result identifies skill selection failure as the dominant scaling bottleneck for skill-library agents and suggests that context-length management is secondary. The provision of both theoretical upper bounds and empirical estimates is a methodological strength that could inform targeted improvements in agent architectures.
major comments (1)
- [Abstract (decomposition paragraph) and the section deriving the bounds] The central decomposition (conditioning on correct skill invocation to isolate context overhead) is load-bearing for the claim that context overhead is negligible. Because invocation success is a post-treatment variable whose probability decreases with library size, the conditional comparison is performed on a selected subset of tasks (those retaining high invocation probability in the large library). These tasks are likely easier or have stronger skill cues, which can downward-bias the measured context-overhead effect and thereby overstate the relative contribution of skill shadowing. The upper-bound derivations do not appear to adjust for this selection mechanism.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying a potential selection bias arising from conditioning on post-treatment invocation success. We address the concern directly below.
read point-by-point responses
-
Referee: [Abstract (decomposition paragraph) and the section deriving the bounds] The central decomposition (conditioning on correct skill invocation to isolate context overhead) is load-bearing for the claim that context overhead is negligible. Because invocation success is a post-treatment variable whose probability decreases with library size, the conditional comparison is performed on a selected subset of tasks (those retaining high invocation probability in the large library). These tasks are likely easier or have stronger skill cues, which can downward-bias the measured context-overhead effect and thereby overstate the relative contribution of skill shadowing. The upper-bound derivations do not appear to adjust for this selection mechanism.
Authors: We acknowledge that conditioning on successful invocation (a post-treatment variable) selects a non-random subset of tasks whose invocation probability remains high even in the large library; these tasks may indeed be easier or possess stronger cues, which can downward-bias the conditional estimate of context overhead. Our upper bounds on both effects, however, are derived from the unconditional pass-rate differences between the small and full libraries and do not rely on the conditional comparison; they therefore remain valid even after accounting for selection. The empirical conditional estimates are presented only as a descriptive decomposition, not as the sole basis for the negligibility claim. We will revise the relevant sections to explicitly discuss this selection mechanism, its direction of bias, and the role of the unconditional upper bounds in bounding the effects. revision: partial
Circularity Check
No significant circularity; decomposition is definitional with independent empirical estimates
full rationale
The paper defines the two effects explicitly via conditioning on invocation and then reports separate empirical estimates plus upper bounds computed from observed trajectories. No equations, fitted parameters, or self-citations are shown that reduce the central claims (growth of shadowing vs. negligible overhead) to inputs by construction. The estimates remain data-driven quantities rather than tautological renamings or self-referential fits. This is the normal non-circular case for an empirical decomposition paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The observed pass rate drop between a helpful-skill library and the full library can be decomposed by conditioning on the skill(s) invoked during a trajectory.
invented entities (2)
-
skill shadowing effect
no independent evidence
-
context overhead effect
no independent evidence
Forward citations
Cited by 1 Pith paper
-
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
Empirical study finds Progressive Disclosure raises distinct resources touched (1.18 to 3.85) and uptake events (1.33 to 3.92) per trajectory, adds 17 passing trials out of 410 (+4.1%), with gains task-dependent.
Reference graph
Works this paper leans on
-
[1]
Tiantian Gan and Qiyao Sun. RAG-MCP: Mitigating prompt bloat in LLM tool selection via retrieval-augmented generation.arXiv preprint arXiv:2505.03275,
-
[2]
SkillReducer: Optimizing LLM Agent Skills for Token Efficiency
Yudong Gao, Zongjie Li, Zimo Ji, Pingchuan Ma, Shuai Wang, et al. SkillReducer: Optimizing LLM agent skills for token efficiency.arXiv preprint arXiv:2603.29919,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. SoK: Agentic skills—beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings
10 Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking LLM skill usage in realistic settings.arXiv preprint arXiv:2604.04323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
break the bundle structure
11 Appendix Table of Contents A Experiment Configurations 12 A.1 Tasks included in our experiments . . . . . . . . . . . . . . . . . . . . . . 12 A.2 Library construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.4 Examples with skill shadowing ...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.