{"total":12,"items":[{"citing_arxiv_id":"2606.00510","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning","primary_cat":"cs.CL","submitted_at":"2026-05-30T04:00:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23899","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills","primary_cat":"cs.AI","submitted_at":"2026-05-22T17:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18401","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution","primary_cat":"cs.CL","submitted_at":"2026-05-18T13:44:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SkillsVote is a governance system for agent skills that profiles corpora, recommends via search, and gates updates on successful reusable outcomes, yielding benchmark gains without model changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[7] Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution.arXiv preprint arXiv:2512.10696, 2025. [8] Le Chen, Erhu Feng, Yubin Xia, and Haibo Chen. Skvm: Revisiting language vm for skills across heterogenous llms and harnesses.arXiv preprint arXiv:2604.03088, 2026. [9] Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, et al. Skillcraft: Can llm agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026. [10] Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al."},{"citing_arxiv_id":"2605.11169","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T19:28:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09359","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill-R1: Agent Skill Evolution via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-10T06:19:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09315","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation","primary_cat":"cs.AI","submitted_at":"2026-05-10T04:20:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLM agents are rapidly shifting from static, manually engineered systems toward self-evolving entities that continuously refine themselves through interaction with their environments [1]. Recent frameworks enable agents to autonomously optimize their reasoning workflows [ 2-6], construct reusable skills and tools [7-12], update their underlying model parameters [13-15], and accumulate persistent memory [ 14, 16, 17] without human intervention. Together, these advances suggest a compelling long-term vision: agents that can continually expand their competence over their lifetime through continual self-directed adaptation. Yet this long-term vision rests on an assumption that has received little systematic scrutiny: when an agent adapts to new tasks within its task scope, it retains competence on tasks it has already mastered."},{"citing_arxiv_id":"2605.09192","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evidence Over Plans: Online Trajectory Verification for Skill Distillation","primary_cat":"cs.AI","submitted_at":"2026-05-09T22:15:13+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"learning tool policies to structuring skills. For instance, Toolformer [13] and ToolLLM [11] focus on learning to invoke tools. Meanwhile, CREATOR [10] and LATM [2] show that stronger models can synthesize reusable artifacts for weaker ones. Skill-oriented frameworks go further by packaging capabilities into modular, self-contained units with clear invocation protocols. SkillCraft [3] studies tool-usage skills, SkillNet [ 6] organizes skills as reusable units, and SkillX [ 17] automatically constructs multi-level skill knowledge bases. In our study, SPARK highlights that the transferred artifact is a natural-languageSKILL.mddocument rather than the executable code. 3 Method We study whether distilled skills are grounded in environment-verified evidence or merely reflect"},{"citing_arxiv_id":"2605.08468","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T20:39:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Table 7 states what the current evidence supports. The strongest empirical claim is limited to the measured hard RL setting. The current evidence does not iso- late eachPYTHALAB-MERAcomponent because no full ablation separat- ing episodic memory, LinUCB retrieval selection, TD credit, skill reuse, and optional decoding control was completed. Prior work motivates such ablations [16,17,18,19], but literature does not replace project-level ablation evidence. PYTHALAB-MERA: Memory and Acceptance Control 21 T able 7.Claim gate for the current manuscript. Claim Current status and allowed wording Local validation-conditioned workflow is implemented. Supported by source, configuration, compile check, and partial tests. Allowed wording: the artifact"},{"citing_arxiv_id":"2605.07358","ref_index":45,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications","primary_cat":"cs.IR","submitted_at":"2026-05-08T07:10:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Reflexion [19], ExpeL [23], BoT [24], ReasoningBank [25], AWM [26], Trace2Skill [27], SayCan [28] , DEPS [29] , Generative Agents [30], GITM [31], RAP [32], Retroformer [33], MemGPT [34], TiM [35], Self-Discover [36], TextGrad [37], FINCON [38], M+ [39], Learned Memory Bank [40], Nemori [41], Intrinsic Memory [42], SkillForge [43] Code-Backed V oyager [12], SkillCraft [44], PolySkill [45], ASI [46], CUA-Skill [47], MetaGPT [6], Eureka [48], DS-Agent [49], LDB [50], CodeAct [51], SWE-agent [52], ToolCoder [53], PSN [54] Hybrid-BasedJARVIS-1 [55], Synapse [56], SkillWeaver [57], AgentSkillOS [58], TPTU [59], talker-reasoner [60], DAMCS [61], GraphSkill [62], Alita [63] Skill Acquisition (§IV) Human-DerivedSkillNet [64], AgentSkillOS [58], Agentic Skills [65], SkillOS [66], Agent Hospital [67]"},{"citing_arxiv_id":"2605.07339","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tools as Continuous Flow for Evolving Agentic Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-08T06:44:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlowAgent models tool chaining as continuous latent trajectory generation with conditional flow matching to deliver global planning, formal utility bounds, and better robustness on long-horizon tasks, plus a new plan-level benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"accumulation in long-horizon reasoning tasks. • Empirical Superiority:Extensive experiments demonstrate that FlowAgent significantly outperforms baseline methods, exhibiting superior robustness in long-horizon tasks and adaptability to unseen toolsets. 2 Related Work LLM-based Tool Reasoning.The efficacy of LLM agents relies heavily on their ability to interact with external tools [ 19, 20]. Early paradigms attempted to mitigate this through explicit prompt 2 (a) Input (b) Conditional Flow Planner (c) Constructing Plan Supervision(d) Discrete Decoding Execution (e) Training Objectives Context State Initial User Query <User>: Is flying from ORD to LAX cheaper than flying to SFO? <latexit sha1_base64=\"zheW0RV7C4vjeY/XFXxA61Rl0zU=\">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0wPqjfrniVt05yCrxclKBHI1++as3iFkaoTRMUK27npsYP6PKcCZwWuqlGhPKxnSIXUsljVD72fzUKTmzyoCEsbIlDZmrvycyGmk9iQLbGVEz0sveTPzP66YmvPYzLpPUoGSLRWEqiInJ7G8y4AqZERNLKFPc"},{"citing_arxiv_id":"2604.17503","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology","primary_cat":"cs.AI","submitted_at":"2026-04-19T15:46:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"evolution of agentic skills through reinforcement learning and closed-loop analy- sis [1,37,41,43,51]. Concurrently, frequent skill iteration demands auditable veri- fication to ensure lifecycle safety [17,18]. As skill banks expand, researchers have introduced ontological networks, complete routing mechanisms, and advanced compositional benchmarks [6,27,54]. Despite this rapid progress, existing frame- works still treat the skill bank and the collaboration structure of multi-agent systems as largely decoupled. Skills are updated in response to task outcomes, but no signal is propagated to restructure how agents collaborate, and visual fea- tures play no role in skill retrieval or evolution. SkillGraph addresses both gaps"},{"citing_arxiv_id":"2604.17308","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2026-04-19T07:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"error correction in tool calling [15], as well as the inefficiency patterns that emerge during complex tool-integrated reasoning [32]. Other work studies whether explicit skill usage improves perfor- mance in realistic software engineering settings [13]. Recent coding-agent benchmarks emphasize realistic, long-horizon tasks under shared Harbor-based execution setups for reproducibility and comparability [6, 7, 9, 22, 36]. 4.2 Skills as Procedural Knowledge for Agents Recent studies treatskillsas reusable procedural knowledge bridging models and workflows, includ- ing large-scale skill management, skill-aware benchmarking, and trajectory distillation into reusable skills [20, 24, 29]. However, these works mainly focus on infrastructure or downstream performance,"}],"limit":50,"offset":0}