Swe-skills-bench: Do agent skills actually help in real-world software engineering?

Tingxu Han, Yi Zhang, Wei Song, Chunrong Fang, Zhenyu Chen, Youcheng Sun, Lijie Hu · 2026 · arXiv 2603.15401

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Counterfactual Trace Auditing of LLM Agent Skills

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

cs.SE · 2026-05-09 · conditional · novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.

SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks

cs.NI · 2026-04-29 · unverdicted · novelty 6.0

SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.

Nautilus: From One Prompt to Plug-and-Play Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

cs.SE · 2026-04-10 · unverdicted · novelty 5.0

SkillMOO automatically evolves skill bundles for LLM coding agents via LLM-proposed edits and NSGA-II, achieving up to 131% higher pass rates and 32% lower costs on three SkillsBench tasks.

citing papers explorer

Showing 8 of 8 citing papers.

Counterfactual Trace Auditing of LLM Agent Skills cs.AI · 2026-05-12 · unverdicted · none · ref 3
CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries cs.SE · 2026-05-09 · conditional · none · ref 12
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents cs.AI · 2026-05-07 · unverdicted · none · ref 7
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents cs.AI · 2026-04-19 · unverdicted · none · ref 13
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
SkVM: Revisiting Language VM for Skills across Heterogenous LLMs and Harnesses cs.SE · 2026-04-03 · unverdicted · none · ref 25
SkVM uses capability profiling and compiler-style techniques to make skills portable across LLMs and harnesses, raising task completion rates while cutting token use by up to 40% and delivering up to 3.2x speedup.
SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks cs.NI · 2026-04-29 · unverdicted · none · ref 18
SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 95
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering cs.SE · 2026-04-10 · unverdicted · none · ref 3
SkillMOO automatically evolves skill bundles for LLM coding agents via LLM-proposed edits and NSGA-II, achieving up to 131% higher pass rates and 32% lower costs on three SkillsBench tasks.

Swe-skills-bench: Do agent skills actually help in real-world software engineering?

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer