pith. machine review for the scientific record. sign in

arxiv: 2605.08386 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

Bowen Zhu, Hasibul Haque, Liang Zhao, Yongliang Miao, Ziyang Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsskill reusehierarchical skillsmulti-granularitycost-efficient adaptationskill evolution
0
0 comments X

The pith

SkillLens organizes skills into a four-layer graph so LLM agents can reuse matching parts at any detail level and rewrite only what mismatches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing skill libraries for LLM agents force a choice between coarse blocks that add noisy context and the high expense of rewriting everything from scratch. SkillLens resolves this by building a graph with policies at the top, then strategies, procedures, and primitives at the bottom, then retrieving seeds by semantic match, expanding them with a degree-corrected random walk, and letting a verifier accept, decompose, rewrite, or skip each piece. The result is direct reuse of compatible subskills while changing only the locally mismatched ones. Theory shows the cost stays sublinear when mismatches are sparse and that repeated evolution steadily raises the validation score until a local peak. On MuLocbench and ALFWorld the method lifts accuracy and success rates over flat-skill baselines.

Core claim

SkillLens builds a four-layer skill graph of policies, strategies, procedures, and primitives, retrieves relevant seeds, expands them via degree-corrected random walk, and applies a verifier that decides whether each visited unit should be accepted, decomposed, rewritten, or skipped; this mixed-granularity process reuses compatible subskills while adapting only locally mismatched components, incurring sublinear cost under sparse-mismatch assumptions and monotonically improving the validation objective through evolutionary refinement of the skills and verifier.

What carries the argument

Four-layer hierarchical skill graph with degree-corrected random-walk expansion and a verifier that routes each unit to accept, decompose, rewrite, or skip.

If this is right

  • Bug-localization Acc@1 rises by as much as 6.31 percentage points over strong skill baselines.
  • Agent success rate on ALFWorld climbs from 45.00 percent to 51.31 percent.
  • Adaptation cost remains sublinear whenever mismatches stay sparse.
  • Evolutionary updates keep raising the validation objective until a local optimum is reached.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixed-granularity routing could be applied to any procedural knowledge base where partial reuse is cheaper than wholesale replacement.
  • If the sparse-mismatch regime holds across many domains, skill libraries could grow much larger without proportional increases in per-task cost.
  • Verifier accuracy becomes the practical bottleneck once mismatch density rises; measuring its error rate on held-out tasks would quantify the safe operating range.

Load-bearing premise

Mismatches between stored skills and new tasks are sparse, so most components can be reused without dense rewriting.

What would settle it

Run SkillLens on tasks engineered so that nearly every retrieved skill component mismatches; if total cost then exceeds the cost of full rewriting while performance stays flat or drops, the sparse-mismatch premise fails.

Figures

Figures reproduced from arXiv: 2605.08386 by Bowen Zhu, Hasibul Haque, Liang Zhao, Yongliang Miao, Ziyang Yu.

Figure 1
Figure 1. Figure 1: Evolution computation cost of three re-writing [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Computation cost under different initial skill ratios. 0.25 0.50 0.75 1.00 Skill ratio 25 50 75 100 125 150 175 200 Value Computation Cost under Different Skill Ratio Total time (min) Retrieval time (min) Token (M) × 50 To analyze the impact of the number of initialized skills on computational cost, we conduct comparative experiments under different initial skill ratios, rang￾ing from 25% to 100%. Specific… view at source ↗
read the original abstract

Skill libraries have become a practical way for LLM agents to reuse procedural experience across tasks. However, existing systems typically treat skills as flat, single-resolution prompt blocks. This creates a tension between relevance and cost: injecting coarse skills can introduce irrelevant or misleading context, while rewriting entire skills is expensive and often unnecessary. We propose SkillLens, a hierarchical skill-evolution framework that organizes skills into a four-layer graph of policies, strategies, procedures, and primitives, and retrieves them at mixed granularity. Given a task, SkillLens first retrieves semantically relevant skill seeds, expands them through degree-corrected random walk over the skill graph, and then uses a verifier to decide whether each visited unit should be accepted, decomposed, rewritten, or skipped. This enables the agent to reuse compatible subskills directly while adapting only locally mismatched components. To improve the system over time, SkillLens further refines multi-granularity skills and verifier in order to improve its routing decisions. We provide theoretical analysis showing that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions and that the evolutionary update rule monotonically improves the validation objective until a local optimum. Across MuLocbench and ALFWorld, SkillLens consistently improves over strong skill-based baselines, achieving up to a 6.31 percentage-point Acc@1 gain for bug localization and raising agent success rate from 45.00% to 51.31%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SkillLens, a hierarchical skill-evolution framework for LLM agents. Skills are organized into a four-layer graph (policies, strategies, procedures, primitives). Given a task, the system retrieves semantically relevant seeds, expands them via degree-corrected random walk, and applies a verifier to decide acceptance, decomposition, rewriting, or skipping of components. It further employs an evolutionary update to refine skills and the verifier over time. The paper claims that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions, that the evolutionary rule monotonically improves the validation objective until a local optimum, and that the approach yields empirical gains (up to 6.31 pp Acc@1 on bug localization; agent success rate rising from 45.00% to 51.31%) over skill-based baselines on MuLocbench and ALFWorld.

Significance. If the sparse-mismatch assumption and verifier reliability hold, SkillLens would provide a principled mechanism for cost-efficient, local skill adaptation in LLM agents, reducing the need for full rewrites while maintaining relevance. The empirical gains and the combination of hierarchical retrieval with evolutionary refinement could influence the design of scalable long-term memory systems for agents.

major comments (2)
  1. [Abstract / Theoretical analysis] Abstract and theoretical analysis section: The sublinear cost guarantee is load-bearing for the cost-efficiency claim and rests on the sparse mismatch assumption (most skill components already match the new task). No measurements of mismatch density (fraction of visited units requiring decomposition/rewrite/skip) are reported on MuLocbench or ALFWorld. Without such grounding, the degree-corrected random walk plus per-unit verifier overhead may exceed the cost of full rewriting when mismatches are dense, directly undermining the reported success-rate improvements.
  2. [Theoretical analysis] Theoretical analysis section: The claim that the evolutionary update rule monotonically improves the validation objective is asserted without a definition of the validation objective, details on verifier training, or demonstration that the objective is independent of the system's own outputs. This leaves the monotonicity result vulnerable to circularity.
minor comments (2)
  1. [Abstract] Abstract: The four layers are named but their precise definitions and inter-layer relations are not summarized; a one-sentence clarification would improve readability.
  2. [Experiments] Experiments section: The description of baselines and whether they receive equivalent total token budgets or verifier calls is unclear; explicit resource-matched comparisons would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Theoretical analysis] Abstract and theoretical analysis section: The sublinear cost guarantee is load-bearing for the cost-efficiency claim and rests on the sparse mismatch assumption (most skill components already match the new task). No measurements of mismatch density (fraction of visited units requiring decomposition/rewrite/skip) are reported on MuLocbench or ALFWorld. Without such grounding, the degree-corrected random walk plus per-unit verifier overhead may exceed the cost of full rewriting when mismatches are dense, directly undermining the reported success-rate improvements.

    Authors: We agree that empirical measurements of mismatch density would provide stronger grounding for the sublinear cost claim. The manuscript derives the sublinear cost result theoretically under the sparse-mismatch assumption but does not report the observed fraction of units requiring adaptation. In the revision we will add a dedicated analysis subsection that computes and reports mismatch densities (fraction of visited units triggering decomposition/rewrite/skip) directly from the experimental logs on both MuLocbench and ALFWorld. We will also include a brief discussion of cost behavior under denser mismatch regimes. revision: yes

  2. Referee: [Theoretical analysis] Theoretical analysis section: The claim that the evolutionary update rule monotonically improves the validation objective is asserted without a definition of the validation objective, details on verifier training, or demonstration that the objective is independent of the system's own outputs. This leaves the monotonicity result vulnerable to circularity.

    Authors: We thank the referee for highlighting the need for greater precision in the theoretical section. The current manuscript states the monotonicity result but does not supply an explicit definition of the validation objective or verifier training details. In the revision we will (1) define the validation objective as the expected success rate on a fixed, held-out validation task set that is independent of the current skill library and verifier outputs, (2) describe the verifier training procedure (periodic supervised updates on oracle-labeled adaptation decisions collected from prior executions), and (3) expand the proof sketch to show that each evolutionary step selects refinements that strictly increase this externally defined objective until a local optimum is reached, thereby eliminating any circularity. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claims rest on external assumptions and independent benchmarks

full rationale

The paper states a theoretical analysis deriving sublinear cost under the sparse mismatch assumption and monotonic improvement of the evolutionary update rule on a validation objective. These are presented as conditional results rather than tautological identities. Empirical performance gains are measured against external baselines on MuLocbench and ALFWorld, not derived from the system's fitted outputs. No equations, self-citations, or definitions in the provided abstract reduce a claimed prediction or first-principles result to its own inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework rests on a new four-layer skill graph, a verifier component, and the sparse-mismatch domain assumption; no free parameters are explicitly named but thresholds inside the verifier and walk are implied.

free parameters (1)
  • verifier decision thresholds
    Used to choose accept, decompose, rewrite, or skip; values not reported in abstract.
axioms (1)
  • domain assumption Sparse mismatch assumption: most visited skill units already match the target task
    Invoked to guarantee sublinear adaptation cost.
invented entities (2)
  • four-layer skill graph (policies, strategies, procedures, primitives) no independent evidence
    purpose: Organize skills for mixed-granularity retrieval
    New hierarchical structure introduced by the paper
  • verifier module no independent evidence
    purpose: Decide per-unit action during expansion
    New learned component for local adaptation

pith-pipeline@v0.9.0 · 5559 in / 1580 out tokens · 49883 ms · 2026-05-12T01:08:23.816011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 13 internal anchors

  1. [1]

    Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766,

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. EvoSkill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026. URLhttps://arxiv.org/abs/2603.02766

  2. [2]

    Towards Explorative IRBL: Combining Semantic Retrieval with LLM-driven Iterative Code Exploration

    Moumita Asad, Rafed Muhammad Yasir, Armin Geramirad, and Sam Malek. GenLoc: Lever- aging large language models for information retrieval-based bug localization.arXiv preprint arXiv:2508.00253, 2025. URLhttps://arxiv.org/abs/2508.00253

  3. [3]

    AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning

    Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning. In Advances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/ 2405.16247

  4. [4]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025. URLhttps://arxiv.org/abs/2504.19413

  5. [5]

    arXiv preprint arXiv:2603.07670 (2026) arXiv:2603.07670

    Pengfei Du. Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026. URL https://arxiv.org/abs/2603. 07670

  6. [6]

    Kutluhan Erol, James Hendler, and Dana S. Nau. HTN planning: Complexity and expressivity. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1123–1128, 1994

  7. [7]

    Memp: Exploring Agent Procedural Memory

    Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, and Ningyu Zhang. Memp: Exploring agent procedural memory.arXiv preprint arXiv:2508.06433, 2025. URLhttps://arxiv.org/abs/2508.06433

  8. [8]

    On three-layer architectures

    Erann Gat. On three-layer architectures. In David Kortenkamp, R. Peter Bonasso, and Robin Murphy, editors,Artificial Intelligence and Mobile Robots, pages 195–210. AAAI Press, 1998

  9. [9]

    Liu, Jongwook Choi, Sungryull Sohn, Yao Fu, Jaekyeom Kim, Dong-Ki Kim, Xinhe Wang, Jaewon Yoo, and Honglak Lee

    Anthony Z. Liu, Jongwook Choi, Sungryull Sohn, Yao Fu, Jaekyeom Kim, Dong-Ki Kim, Xinhe Wang, Jaewon Yoo, and Honglak Lee. SkillAct: Using skill abstractions improves LLM agents. InICML 2024 Workshop on LLMs and Cognition, 2024. URL https://openreview. net/forum?id=6LG3cIRrF4

  10. [10]

    How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

    Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking LLM skill usage in realistic settings.arXiv preprint arXiv:2604.04323, 2026. URLhttps://arxiv.org/abs/2604.04323

  11. [11]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Sy...

  12. [12]

    Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

    Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, and Jun Wang. ProcMEM: Learning reusable procedural memory from experience via non-parametric PPO for LLM agents.arXiv preprint arXiv:2602.01869, 2026. URL https://arxiv.org/abs/2602. 01869

  13. [13]

    Nemhauser, Laurence A

    George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approxima- tions for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294,

  14. [14]

    doi: 10.1007/BF01588971

  15. [15]

    Mohammad Masudur Rahman and Chanchal K. Roy. Improving IR-based bug localization with context-aware query reformulation. InACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 495–506, 2018. doi: 10.1145/3236024.3236064. 10

  16. [16]

    Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E

    Ripon K. Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E. Perry. Improving bug localization using structured information retrieval. InIEEE/ACM International Conference on Automated Software Engineering, pages 345–355, 2013. doi: 10.1109/ASE.2013.6693093

  17. [17]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2302.04761

  18. [18]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2303.11366

  19. [19]

    Alfworld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021

  20. [20]

    Sutton and Doina Precup and Satinder Singh , keywords =

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999. doi: 10.1016/S0004-3702(99)00052-1

  21. [21]

    FeUdal networks for hierarchical reinforcement learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InInternational Conference on Machine Learning, pages 3540–3549, 2017. URL https: //proceedings.mlr.press/v70/vezhnevets17a.html

  22. [22]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2305. 16291

  23. [23]

    Executable code actions elicit better LLM agents,

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2402.01030

  24. [24]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based software engineering agents.arXiv preprint arXiv:2407.01489, 2024. URL https://arxiv.org/abs/2407.01489

  25. [25]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026. URL https://arxiv. org/abs/2602.12430

  26. [26]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents.arXiv preprint arXiv:2502.12110, 2025. URL https: //arxiv.org/abs/2502.12110

  27. [27]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/abs/2405.15793

  28. [28]

    arXiv preprint arXiv:2603.01145 , year=

    Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026. URL https://arxiv.org/abs/ 2603.01145

  29. [29]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_ vluYUL-X. 11

  30. [30]

    RepoCoder: Repository-level code completion through iterative retrieval and generation

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian- Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. InConference on Empirical Methods in Natural Language Processing,

  31. [31]

    URLhttps://arxiv.org/abs/2303.12570

  32. [32]

    Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S. Yu. EvoSkills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687, 2026. URLhttps://arxiv.org/abs/2604.01687

  33. [33]

    AutoCodeRover: Au- tonomous program improvement.arXiv preprint arXiv:2404.05427, 2024

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Au- tonomous program improvement. InACM SIGSOFT International Symposium on Software Testing and Analysis, 2024. URLhttps://arxiv.org/abs/2404.05427

  34. [34]

    A benchmark for localizing code and non-code issues in software projects.arXiv preprint arXiv:2509.25242, 2025

    Zejun Zhang, Jian Wang, Qingyun Yang, Yifan Pan, Yi Tang, Yi Li, Zhenchang Xing, Tian Zhang, Xuandong Li, and Guoan Zhang. A benchmark for localizing code and non-code issues in software projects.arXiv preprint arXiv:2509.25242, 2025. doi: 10.48550/arXiv.2509.25242

  35. [35]

    arXiv preprint arXiv:2404.13501 , year=

    Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501, 2024. URLhttps://arxiv.org/abs/2404.13501

  36. [36]

    Expel: LLM agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024. doi: 10.1609/aaai.v38i17.29936

  37. [37]

    SkillRouter: Retrieve-and-rerank skill selection for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

    YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuan Zhu, Baohua Dong, and Hangcheng Zhu. SkillRouter: Retrieve-and-rerank skill selection for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026. URLhttps://arxiv.org/abs/2603.22455

  38. [38]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. MemoryBank: Enhancing large language models with long-term memory.arXiv preprint arXiv:2305.10250, 2023. URL https://arxiv.org/abs/2305.10250

  39. [39]

    does not start with

    Jian Zhou, Hongyu Zhang, and David Lo. BugLocator: Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. InInternational Conference on Software Engineering, pages 14–24, 2012. doi: 10.1109/ICSE.2012.6227210. 12 A Implementation Details For repository-localization tasks, The training dataset contai...

  40. [40]

    Search likely locations such as the countertop, cabinet, and drawer

  41. [41]

    Take the apple once it is found

  42. [42]

    Go to the refrigerator

  43. [43]

    Open the refrigerator

  44. [44]

    Put the apple in the refrigerator

  45. [45]

    Close the refrigerator if the environment requires it

  46. [46]

    Open the refrigerator again and take the cooled apple

  47. [47]

    Go to the dining table

  48. [48]

    Put the cooled apple on the dining table. Gap Analysis Successful transfer.The retrieved skill correctly provides the high-level structure for object-state-and- placement tasks: find the object, transform its state, and place it at the requested destination. Local mismatch.The retrieved procedure contains a cleaning operation, but the current task require...