arxiv: 2605.08386 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

Bowen Zhu, Hasibul Haque, Liang Zhao, Yongliang Miao, Ziyang Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsskill reusehierarchical skillsmulti-granularitycost-efficient adaptationskill evolution

0 comments

The pith

SkillLens organizes skills into a four-layer graph so LLM agents can reuse matching parts at any detail level and rewrite only what mismatches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing skill libraries for LLM agents force a choice between coarse blocks that add noisy context and the high expense of rewriting everything from scratch. SkillLens resolves this by building a graph with policies at the top, then strategies, procedures, and primitives at the bottom, then retrieving seeds by semantic match, expanding them with a degree-corrected random walk, and letting a verifier accept, decompose, rewrite, or skip each piece. The result is direct reuse of compatible subskills while changing only the locally mismatched ones. Theory shows the cost stays sublinear when mismatches are sparse and that repeated evolution steadily raises the validation score until a local peak. On MuLocbench and ALFWorld the method lifts accuracy and success rates over flat-skill baselines.

Core claim

SkillLens builds a four-layer skill graph of policies, strategies, procedures, and primitives, retrieves relevant seeds, expands them via degree-corrected random walk, and applies a verifier that decides whether each visited unit should be accepted, decomposed, rewritten, or skipped; this mixed-granularity process reuses compatible subskills while adapting only locally mismatched components, incurring sublinear cost under sparse-mismatch assumptions and monotonically improving the validation objective through evolutionary refinement of the skills and verifier.

What carries the argument

Four-layer hierarchical skill graph with degree-corrected random-walk expansion and a verifier that routes each unit to accept, decompose, rewrite, or skip.

If this is right

Bug-localization Acc@1 rises by as much as 6.31 percentage points over strong skill baselines.
Agent success rate on ALFWorld climbs from 45.00 percent to 51.31 percent.
Adaptation cost remains sublinear whenever mismatches stay sparse.
Evolutionary updates keep raising the validation objective until a local optimum is reached.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixed-granularity routing could be applied to any procedural knowledge base where partial reuse is cheaper than wholesale replacement.
If the sparse-mismatch regime holds across many domains, skill libraries could grow much larger without proportional increases in per-task cost.
Verifier accuracy becomes the practical bottleneck once mismatch density rises; measuring its error rate on held-out tasks would quantify the safe operating range.

Load-bearing premise

Mismatches between stored skills and new tasks are sparse, so most components can be reused without dense rewriting.

What would settle it

Run SkillLens on tasks engineered so that nearly every retrieved skill component mismatches; if total cost then exceeds the cost of full rewriting while performance stays flat or drops, the sparse-mismatch premise fails.

Figures

Figures reproduced from arXiv: 2605.08386 by Bowen Zhu, Hasibul Haque, Liang Zhao, Yongliang Miao, Ziyang Yu.

**Figure 2.** Figure 2: Computation cost under different initial skill ratios. 0.25 0.50 0.75 1.00 Skill ratio 25 50 75 100 125 150 175 200 Value Computation Cost under Different Skill Ratio Total time (min) Retrieval time (min) Token (M) × 50 To analyze the impact of the number of initialized skills on computational cost, we conduct comparative experiments under different initial skill ratios, ranging from 25% to 100%. Specific… view at source ↗

read the original abstract

Skill libraries have become a practical way for LLM agents to reuse procedural experience across tasks. However, existing systems typically treat skills as flat, single-resolution prompt blocks. This creates a tension between relevance and cost: injecting coarse skills can introduce irrelevant or misleading context, while rewriting entire skills is expensive and often unnecessary. We propose SkillLens, a hierarchical skill-evolution framework that organizes skills into a four-layer graph of policies, strategies, procedures, and primitives, and retrieves them at mixed granularity. Given a task, SkillLens first retrieves semantically relevant skill seeds, expands them through degree-corrected random walk over the skill graph, and then uses a verifier to decide whether each visited unit should be accepted, decomposed, rewritten, or skipped. This enables the agent to reuse compatible subskills directly while adapting only locally mismatched components. To improve the system over time, SkillLens further refines multi-granularity skills and verifier in order to improve its routing decisions. We provide theoretical analysis showing that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions and that the evolutionary update rule monotonically improves the validation objective until a local optimum. Across MuLocbench and ALFWorld, SkillLens consistently improves over strong skill-based baselines, achieving up to a 6.31 percentage-point Acc@1 gain for bug localization and raising agent success rate from 45.00% to 51.31%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillLens adds a workable adaptive verifier and hierarchy on top of skill libraries with some benchmark gains, but the sublinear cost claim rests on an untested sparse mismatch assumption.

read the letter

SkillLens organizes skills into a four-layer graph of policies, strategies, procedures, and primitives, then uses semantic retrieval followed by degree-corrected random walk expansion and a verifier that picks accept, decompose, rewrite, or skip for each piece. The evolutionary update to both the skills and the verifier is meant to improve routing over time. That specific mix of mixed-granularity routing plus local adaptation is the clearest new element relative to flat skill libraries in prior agent work. The reported results are straightforward: up to 6.31 points higher Acc@1 on bug localization and success rate rising from 45% to 51.31% on ALFWorld, which shows the approach can help on the tested tasks. The paper does a reasonable job laying out the retrieval and decision flow in plain terms. The main weakness is that the sublinear cost guarantee depends on the sparse mismatch assumption, yet the work does not report measured mismatch rates or densities on MuLocbench or ALFWorld. Without that data, it is hard to know whether the random-walk plus verifier overhead stays below the cost of full rewriting when mismatches are more common. Details on how the verifier is trained and what the validation objective actually optimizes are also missing from the description, so the monotonic improvement claim cannot be checked directly. This paper is aimed at people building LLM agents for software or embodied tasks who already use skill libraries and want a more selective reuse mechanism. A practitioner or applied researcher would get concrete ideas from the architecture and the numbers, even if they would want to test the cost assumption themselves. I would send it to peer review because the mechanism is concrete enough to be worth checking with full details and any added experiments on mismatch frequency.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SkillLens, a hierarchical skill-evolution framework for LLM agents. Skills are organized into a four-layer graph (policies, strategies, procedures, primitives). Given a task, the system retrieves semantically relevant seeds, expands them via degree-corrected random walk, and applies a verifier to decide acceptance, decomposition, rewriting, or skipping of components. It further employs an evolutionary update to refine skills and the verifier over time. The paper claims that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions, that the evolutionary rule monotonically improves the validation objective until a local optimum, and that the approach yields empirical gains (up to 6.31 pp Acc@1 on bug localization; agent success rate rising from 45.00% to 51.31%) over skill-based baselines on MuLocbench and ALFWorld.

Significance. If the sparse-mismatch assumption and verifier reliability hold, SkillLens would provide a principled mechanism for cost-efficient, local skill adaptation in LLM agents, reducing the need for full rewrites while maintaining relevance. The empirical gains and the combination of hierarchical retrieval with evolutionary refinement could influence the design of scalable long-term memory systems for agents.

major comments (2)

[Abstract / Theoretical analysis] Abstract and theoretical analysis section: The sublinear cost guarantee is load-bearing for the cost-efficiency claim and rests on the sparse mismatch assumption (most skill components already match the new task). No measurements of mismatch density (fraction of visited units requiring decomposition/rewrite/skip) are reported on MuLocbench or ALFWorld. Without such grounding, the degree-corrected random walk plus per-unit verifier overhead may exceed the cost of full rewriting when mismatches are dense, directly undermining the reported success-rate improvements.
[Theoretical analysis] Theoretical analysis section: The claim that the evolutionary update rule monotonically improves the validation objective is asserted without a definition of the validation objective, details on verifier training, or demonstration that the objective is independent of the system's own outputs. This leaves the monotonicity result vulnerable to circularity.

minor comments (2)

[Abstract] Abstract: The four layers are named but their precise definitions and inter-layer relations are not summarized; a one-sentence clarification would improve readability.
[Experiments] Experiments section: The description of baselines and whether they receive equivalent total token budgets or verifier calls is unclear; explicit resource-matched comparisons would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Theoretical analysis] Abstract and theoretical analysis section: The sublinear cost guarantee is load-bearing for the cost-efficiency claim and rests on the sparse mismatch assumption (most skill components already match the new task). No measurements of mismatch density (fraction of visited units requiring decomposition/rewrite/skip) are reported on MuLocbench or ALFWorld. Without such grounding, the degree-corrected random walk plus per-unit verifier overhead may exceed the cost of full rewriting when mismatches are dense, directly undermining the reported success-rate improvements.

Authors: We agree that empirical measurements of mismatch density would provide stronger grounding for the sublinear cost claim. The manuscript derives the sublinear cost result theoretically under the sparse-mismatch assumption but does not report the observed fraction of units requiring adaptation. In the revision we will add a dedicated analysis subsection that computes and reports mismatch densities (fraction of visited units triggering decomposition/rewrite/skip) directly from the experimental logs on both MuLocbench and ALFWorld. We will also include a brief discussion of cost behavior under denser mismatch regimes. revision: yes
Referee: [Theoretical analysis] Theoretical analysis section: The claim that the evolutionary update rule monotonically improves the validation objective is asserted without a definition of the validation objective, details on verifier training, or demonstration that the objective is independent of the system's own outputs. This leaves the monotonicity result vulnerable to circularity.

Authors: We thank the referee for highlighting the need for greater precision in the theoretical section. The current manuscript states the monotonicity result but does not supply an explicit definition of the validation objective or verifier training details. In the revision we will (1) define the validation objective as the expected success rate on a fixed, held-out validation task set that is independent of the current skill library and verifier outputs, (2) describe the verifier training procedure (periodic supervised updates on oracle-labeled adaptation decisions collected from prior executions), and (3) expand the proof sketch to show that each evolutionary step selects refinements that strictly increase this externally defined objective until a local optimum is reached, thereby eliminating any circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claims rest on external assumptions and independent benchmarks

full rationale

The paper states a theoretical analysis deriving sublinear cost under the sparse mismatch assumption and monotonic improvement of the evolutionary update rule on a validation objective. These are presented as conditional results rather than tautological identities. Empirical performance gains are measured against external baselines on MuLocbench and ALFWorld, not derived from the system's fitted outputs. No equations, self-citations, or definitions in the provided abstract reduce a claimed prediction or first-principles result to its own inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework rests on a new four-layer skill graph, a verifier component, and the sparse-mismatch domain assumption; no free parameters are explicitly named but thresholds inside the verifier and walk are implied.

free parameters (1)

verifier decision thresholds
Used to choose accept, decompose, rewrite, or skip; values not reported in abstract.

axioms (1)

domain assumption Sparse mismatch assumption: most visited skill units already match the target task
Invoked to guarantee sublinear adaptation cost.

invented entities (2)

four-layer skill graph (policies, strategies, procedures, primitives) no independent evidence
purpose: Organize skills for mixed-granularity retrieval
New hierarchical structure introduced by the paper
verifier module no independent evidence
purpose: Decide per-unit action during expansion
New learned component for local adaptation

pith-pipeline@v0.9.0 · 5559 in / 1580 out tokens · 49883 ms · 2026-05-12T01:08:23.816011+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 13 internal anchors

[1]

Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766,

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. EvoSkill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026. URLhttps://arxiv.org/abs/2603.02766

work page arXiv 2026
[2]

Towards Explorative IRBL: Combining Semantic Retrieval with LLM-driven Iterative Code Exploration

Moumita Asad, Rafed Muhammad Yasir, Armin Geramirad, and Sam Malek. GenLoc: Lever- aging large language models for information retrieval-based bug localization.arXiv preprint arXiv:2508.00253, 2025. URLhttps://arxiv.org/abs/2508.00253

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning

Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning. In Advances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/ 2405.16247

work page arXiv 2024
[4]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. URLhttps://arxiv.org/abs/2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2603.07670 (2026) arXiv:2603.07670

Pengfei Du. Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026. URL https://arxiv.org/abs/2603. 07670

work page arXiv 2026
[6]

Kutluhan Erol, James Hendler, and Dana S. Nau. HTN planning: Complexity and expressivity. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1123–1128, 1994

work page 1994
[7]

Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025. URLhttps://arxiv.org/abs/2508.06433

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

On three-layer architectures

Erann Gat. On three-layer architectures. In David Kortenkamp, R. Peter Bonasso, and Robin Murphy, editors,Artificial Intelligence and Mobile Robots, pages 195–210. AAAI Press, 1998

work page 1998
[9]

Liu, Jongwook Choi, Sungryull Sohn, Yao Fu, Jaekyeom Kim, Dong-Ki Kim, Xinhe Wang, Jaewon Yoo, and Honglak Lee

Anthony Z. Liu, Jongwook Choi, Sungryull Sohn, Yao Fu, Jaekyeom Kim, Dong-Ki Kim, Xinhe Wang, Jaewon Yoo, and Honglak Lee. SkillAct: Using skill abstractions improves LLM agents. InICML 2024 Workshop on LLMs and Cognition, 2024. URL https://openreview. net/forum?id=6LG3cIRrF4

work page 2024
[10]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking LLM skill usage in realistic settings.arXiv preprint arXiv:2604.04323, 2026. URLhttps://arxiv.org/abs/2604.04323

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sy...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. ProcMEM: Learning reusable procedural memory from experience via non-parametric PPO for LLM agents.arXiv preprint arXiv:2602.01869, 2026. URL https://arxiv.org/abs/2602. 01869

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Nemhauser, Laurence A

George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approxima- tions for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294,

work page
[14]

doi: 10.1007/BF01588971

work page doi:10.1007/bf01588971
[15]

Mohammad Masudur Rahman and Chanchal K. Roy. Improving IR-based bug localization with context-aware query reformulation. InACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 495–506, 2018. doi: 10.1145/3236024.3236064. 10

work page doi:10.1145/3236024.3236064 2018
[16]

Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E

Ripon K. Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E. Perry. Improving bug localization using structured information retrieval. InIEEE/ACM International Conference on Automated Software Engineering, pages 345–355, 2013. doi: 10.1109/ASE.2013.6693093

work page doi:10.1109/ase.2013.6693093 2013
[17]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Alfworld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021

work page 2021
[20]

Sutton and Doina Precup and Satinder Singh , keywords =

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999. doi: 10.1016/S0004-3702(99)00052-1

work page doi:10.1016/s0004-3702(99)00052-1 1999
[21]

FeUdal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InInternational Conference on Machine Learning, pages 3540–3549, 2017. URL https: //proceedings.mlr.press/v70/vezhnevets17a.html

work page 2017
[22]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2305. 16291

work page 2024
[23]

Executable code actions elicit better LLM agents,

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2402.01030

work page arXiv 2024
[24]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024. URL https://arxiv.org/abs/2407.01489

work page internal anchor Pith review arXiv 2024
[25]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026. URL https://arxiv. org/abs/2602.12430

work page internal anchor Pith review arXiv 2026
[26]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents.arXiv preprint arXiv:2502.12110, 2025. URL https: //arxiv.org/abs/2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

arXiv preprint arXiv:2603.01145 , year=

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026. URL https://arxiv.org/abs/ 2603.01145

work page arXiv 2026
[29]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_ vluYUL-X. 11

work page 2023
[30]

RepoCoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian- Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. InConference on Empirical Methods in Natural Language Processing,

work page
[31]

URLhttps://arxiv.org/abs/2303.12570

work page arXiv
[32]

Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. EvoSkills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026. URLhttps://arxiv.org/abs/2604.01687

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

AutoCodeRover: Au- tonomous program improvement.arXiv preprint arXiv:2404.05427, 2024

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InACM SIGSOFT International Symposium on Software Testing and Analysis, 2024. URLhttps://arxiv.org/abs/2404.05427

work page arXiv 2024
[34]

A benchmark for localizing code and non-code issues in software projects.arXiv preprint arXiv:2509.25242, 2025

Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, and Guoan Zhang. A benchmark for localizing code and non-code issues in software projects.arXiv preprint arXiv:2509.25242, 2025. doi: 10.48550/arXiv.2509.25242

work page doi:10.48550/arxiv.2509.25242 2025
[35]

arXiv preprint arXiv:2404.13501 , year=

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501, 2024. URLhttps://arxiv.org/abs/2404.13501

work page arXiv 2024
[36]

Expel: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936

work page doi:10.1609/aaai.v38i17.29936 2024
[37]

SkillRouter: Retrieve-and-rerank skill selection for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuan Zhu, Baohua Dong, and Hangcheng Zhu. SkillRouter: Retrieve-and-rerank skill selection for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026. URLhttps://arxiv.org/abs/2603.22455

work page arXiv 2026
[38]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250, 2023. URL https://arxiv.org/abs/2305.10250

work page arXiv 2023
[39]

does not start with

Jian Zhou, Hongyu Zhang, and David Lo. BugLocator: Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. InInternational Conference on Software Engineering, pages 14–24, 2012. doi: 10.1109/ICSE.2012.6227210. 12 A Implementation Details For repository-localization tasks, The training dataset contai...

work page doi:10.1109/icse.2012.6227210 2012
[40]

Search likely locations such as the countertop, cabinet, and drawer

work page
[41]

Take the apple once it is found

work page
[42]

Go to the refrigerator

work page
[43]

Open the refrigerator

work page
[44]

Put the apple in the refrigerator

work page
[45]

Close the refrigerator if the environment requires it

work page
[46]

Open the refrigerator again and take the cooled apple

work page
[47]

Go to the dining table

work page
[48]

Put the cooled apple on the dining table. Gap Analysis Successful transfer.The retrieved skill correctly provides the high-level structure for object-state-and- placement tasks: find the object, transform its state, and place it at the requested destination. Local mismatch.The retrieved procedure contains a cleaning operation, but the current task require...

work page