{"total":12,"items":[{"citing_arxiv_id":"2606.01139","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision","primary_cat":"cs.AI","submitted_at":"2026-05-31T10:19:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27492","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems","primary_cat":"cs.SE","submitted_at":"2026-05-26T16:28:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23899","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills","primary_cat":"cs.AI","submitted_at":"2026-05-22T17:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A systematic study across five domains finds model-generated skills yield average gains but non-uniform negative transfer, with a meta-skill improving extraction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20456","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development","primary_cat":"cs.SE","submitted_at":"2026-05-19T20:10:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Agentic Agile-V uses Agile-V as backbone and a Specify-Constrain-Orchestrate-Prove-Evolve-Verify loop to convert AI agent conversations into traceable engineering artifacts with acceptance evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11946","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Counterfactual Trace Auditing of LLM Agent Skills","primary_cat":"cs.AI","submitted_at":"2026-05-12T10:56:18+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"skills can change how an agent searches, edits, tests, and reasons. Yet the evaluation toolkit for skills remains narrow. General coding agents are usually evaluated through aggregate task success [1, 2]. Skill evaluations often reduce the comparison to one scalar: the change in unit test pass rate ∆P between thewith skillandwithout skillconditions on the same task.SWE-Skills-Bench[ 3], the only public benchmark we are aware of that releases paired traces for this setting, reports results in this form and also informally lists selected failure modes (surface anchoring, hallucination, concept bleed) when pass rate alone is not explanatory. This setup is useful, but it treats a skill as a black box intervention and discards most of the behavioral evidence in the trace."},{"citing_arxiv_id":"2605.11665","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Nautilus: From One Prompt to Plug-and-Play Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T07:26:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026. [94] Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, and Lijie Hu. Swe-skills-bench: Do agent skills actually help in real-world software engineering?, 2026. URL https://arxiv.org/abs/2603.15401. [95] Yixin Lin, Austin S. Wang, Eric Undersander, and Akshara Rai. Polymetis: a real-time PyTorch controller manager for robotics.https://facebookresearch.github.io/fairo/ polymetis/, 2021. [96] Sabela Ramos, Sertan Girgin, Léonard Hussenot, Manu Orsini, Piotr Stanczyk, Olivier Pietquin, Matthieu Geist, and Olivier Bachem. RLDS: an ecosystem to generate, share and use datasets in"},{"citing_arxiv_id":"2605.10990","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries","primary_cat":"cs.SE","submitted_at":"2026-05-09T11:41:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Probing every observed value creates false alarms, while probing only declared manifests misses assumptions expressed in prose. SKILLGUARDaddresses this gap by validating role-bearing environment contracts rather than all observed values. Agent and software-maintenance benchmarks.Agent benchmarks such as AgentBench [ 17], SWE-bench [14], and SWE-Skills-Bench [12] evaluate task-solving, software engineering, and skill quality. These benchmarks are valuable, but they do not isolate the maintenance failure studied here: a skill that was once valid can degrade when the external environment changes. DRIFTBENCH complements them by pairing previously valid skills with generated drifts, LLM-free real-world drifts,"},{"citing_arxiv_id":"2605.05726","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T06:18:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26278","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks","primary_cat":"cs.NI","submitted_at":"2026-04-29T04:20:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17308","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2026-04-19T07:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A growing line of work evaluates agent capabilities in controlled environments with tool use and multi-step reasoning [21, 27, 33]. Some benchmarks further focus on reliability aspects such as error correction in tool calling [15], as well as the inefficiency patterns that emerge during complex tool-integrated reasoning [32]. Other work studies whether explicit skill usage improves perfor- mance in realistic software engineering settings [13]. Recent coding-agent benchmarks emphasize realistic, long-horizon tasks under shared Harbor-based execution setups for reproducibility and comparability [6, 7, 9, 22, 36]. 4.2 Skills as Procedural Knowledge for Agents Recent studies treatskillsas reusable procedural knowledge bridging models and workflows, includ- ing large-scale skill management, skill-aware benchmarking, and trajectory distillation into reusable"},{"citing_arxiv_id":"2604.09297","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering","primary_cat":"cs.SE","submitted_at":"2026-04-10T13:08:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillMOO applies LLM-proposed edits and NSGA-II Pareto optimization to skill bundles for SE agents, ranking top in pass rate on most SkillsBench tasks while cutting costs up to 31.7%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03088","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses","primary_cat":"cs.SE","submitted_at":"2026-04-03T15:11:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Through SkVM, skills are compiled into forms tailored to different models, allowing each model to better understand and exe- cute skills. Furthermore,SkVMprovides a unified runtime environment that systematically manages skill loading, pars- ing, and concurrent execution. We structureSkVMaround classical compilation techniques: interpreted execution [25], ahead-of-time (AOT) compilation [48, 49], and just-in-time (JIT) optimization [ 3, 12, 15, 24, 52].Cur- rently, agents handle skills using only interpreted execution, feeding raw skill text directly to the model. This approach hampers skill portability and execution efficiency. In contrast, SkVMfurther applies AOT and JIT compilation to optimize"}],"limit":50,"offset":0}