Recognition: no theorem link
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
Pith reviewed 2026-05-12 01:45 UTC · model grok-4.3
The pith
SkCC compiles LLM agent skills into a portable intermediate representation that works across frameworks with built-in security.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that introducing classical compilation techniques, centered on a strongly-typed IR called SkIR, allows skill semantics to be separated from framework formatting. This enables a four-phase pipeline that reduces adaptation complexity from O(m × n) to O(m + n), while the Optimizer blocks vulnerabilities before deployment.
What carries the argument
SkIR, the strongly-typed intermediate representation that captures the full semantics of skills independently of framework-specific prompt formats.
If this is right
- Adaptation effort scales linearly with the number of skills and frameworks rather than quadratically.
- Security vulnerabilities are detected and blocked proactively with a 94.8% trigger rate.
- Consistent pass rate improvements across tested frameworks like Claude Code and Kimi CLI.
- Sub-10ms compilation latency and runtime token reductions of 10-46%.
Where Pith is reading between the lines
- Skill authors could create once and reach all frameworks without additional work.
- It opens the possibility for a unified skill ecosystem or marketplace.
- Similar compilation approaches might address portability in other prompt engineering domains.
Load-bearing premise
One strongly-typed IR is sufficient to represent the semantics of any Markdown-written skill without losing necessary details or expressiveness.
What would settle it
A test where SkCC-compiled skills perform no better than originals or still need per-framework tweaks on new frameworks would disprove the portability benefit.
Figures
read the original abstract
LLM agents increasingly rely on reusable skills (e.g., `SKILL.md`) to execute complex tasks, yet these artifacts lack portability: agent frameworks are highly sensitive to prompt formatting, leading to a large performance variation for the same skill. Nevertheless, most skills are authored once as format-agnostic Markdown, necessitating costly per-framework rewrites and also leaving security largely unaddressed, with widespread vulnerabilities in practice. To address this, we present SkCC, a compiler for LLM agents that introduces classical compilation design into agent skill development. SkCC centers on SkIR, a strongly-typed intermediate representation that decouples skill semantics from framework-specific formatting, thus enabling portable deployment across agent frameworks. Atop of this IR, a static Optimizer enforces security constraints, blocking vulnerabilities before deployment. Implemented as a four-phase pipeline, SkCC effectively reduces adaptation complexity from $O(m \times n)$ to $O(m + n)$ across $m$ skills and $n$ frameworks. Experiments on SkillsBench demonstrate that SkCC delivers consistent and substantial gains over original counterparts, with pass rate increases from 21.1% to 33.3% on Claude Code and from 35.1% to 48.7% on Kimi CLI. Further, the design achieves sub-10ms compilation latency, 94.8% proactive security trigger rate, and 10-46% runtime token savings across frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SkCC, a compiler for LLM agent skills that introduces a strongly-typed intermediate representation (SkIR) to decouple skill semantics from framework-specific formatting. It describes a four-phase pipeline that reduces adaptation complexity from O(m × n) to O(m + n) across m skills and n frameworks, includes a static optimizer to enforce security constraints before deployment, and reports experimental results on SkillsBench showing pass-rate gains (21.1% to 33.3% on Claude Code; 35.1% to 48.7% on Kimi CLI), sub-10 ms compilation latency, 94.8% proactive security trigger rate, and 10-46% runtime token savings.
Significance. If the central claims hold, the work would be significant for standardizing skill development in LLM agents, offering a practical path to portability and proactive security via classical compilation techniques. The four-phase pipeline design and cross-framework empirical evaluation on SkillsBench provide concrete evidence of reduced engineering effort and measurable efficiency gains, which could influence how reusable skills are authored and deployed in agent systems.
major comments (2)
- [Abstract and SkIR/pipeline description] The O(m + n) complexity reduction and portability guarantee rest on the assumption that SkIR can represent the complete semantics and intent of arbitrary natural-language Markdown skills without loss or the need for framework-specific extensions. No formal grammar, type definitions, or coverage analysis for edge cases (e.g., implicit context, conditional logic, or ambiguous instructions) is supplied in the SkIR definition or pipeline description, leaving open the possibility that expressiveness gaps would reintroduce per-framework adaptations.
- [Experimental evaluation] The reported performance and security metrics (pass-rate increases, 94.8% trigger rate, token savings) are load-bearing for the practicality claims, yet the experimental section provides no details on SkillsBench composition, data splits, baseline implementations, number of skills/frameworks tested, or statistical tests. This prevents verification that the gains are consistent and attributable to the compiler rather than experimental artifacts.
minor comments (2)
- [Abstract] The abstract introduces the four-phase pipeline but does not name the phases; a one-sentence enumeration would improve immediate clarity without lengthening the abstract.
- [Introduction] Acronyms SkCC and SkIR should be expanded on first use in the main body even if defined in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will strengthen the manuscript while defending the core contributions on the basis of the presented design and results.
read point-by-point responses
-
Referee: The O(m + n) complexity reduction and portability guarantee rest on the assumption that SkIR can represent the complete semantics and intent of arbitrary natural-language Markdown skills without loss or the need for framework-specific extensions. No formal grammar, type definitions, or coverage analysis for edge cases (e.g., implicit context, conditional logic, or ambiguous instructions) is supplied in the SkIR definition or pipeline description, leaving open the possibility that expressiveness gaps would reintroduce per-framework adaptations.
Authors: We acknowledge that the current manuscript presents SkIR at a descriptive level with illustrative examples rather than a complete formal grammar or exhaustive coverage analysis. The strongly-typed nature of SkIR is intended to capture core semantics (control flow, data dependencies, and security-relevant operations) through its type system, and the four-phase pipeline is designed to preserve these semantics during lowering. The empirical results on SkillsBench demonstrate that the evaluated skills compiled and executed portably without requiring framework-specific extensions, supporting the O(m + n) claim in practice. To address the concern directly, the revised manuscript will include an explicit formal grammar for SkIR, complete type definitions, and a dedicated subsection analyzing coverage of edge cases such as conditional logic and ambiguous instructions, drawing on the skills present in the benchmark. revision: yes
-
Referee: The reported performance and security metrics (pass-rate increases, 94.8% trigger rate, token savings) are load-bearing for the practicality claims, yet the experimental section provides no details on SkillsBench composition, data splits, baseline implementations, number of skills/frameworks tested, or statistical tests. This prevents verification that the gains are consistent and attributable to the compiler rather than experimental artifacts.
Authors: We agree that the experimental section must be expanded for reproducibility and to allow readers to verify that observed gains are attributable to SkCC. The manuscript currently reports aggregate pass-rate, latency, security-trigger, and token-saving figures but does not detail the benchmark composition, evaluation protocol, baseline implementations, exact numbers of skills and frameworks, or statistical tests. In the revision we will add: (i) a full description of SkillsBench including skill count, framework coverage, and task categories; (ii) the evaluation protocol and any data splits used; (iii) precise baseline implementations; (iv) the number of trials per configuration; and (v) statistical significance tests (e.g., paired t-tests with p-values) confirming that the reported improvements are consistent and not artifacts. revision: yes
Circularity Check
No circularity: design claim and empirical results are independent
full rationale
The abstract and provided text introduce SkCC as a four-phase compiler pipeline centered on a new strongly-typed SkIR that decouples Markdown skill semantics from framework formatting. The O(m+n) complexity reduction is a direct consequence of successful decoupling (standard compiler benefit) rather than a fitted or self-referential quantity. Reported gains (pass-rate improvements, sub-10ms latency, 94.8% security trigger rate, token savings) are presented as experimental measurements on SkillsBench, not as predictions derived from parameters fitted to the same data or from self-citations. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear in the supplied material that would collapse any central claim back to its inputs by construction. The design's correctness therefore rests on external validation rather than definitional equivalence.
Axiom & Free-Parameter Ledger
invented entities (2)
-
SkIR
no independent evidence
-
SkCC Optimizer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Michael Wooldridge and Nicholas R. Jennings. Intelligent Agents: Theory and Practice.The Knowledge Engineering Review, 10(2):115–152, 1995. doi: 10.1017/S0269888900008122
-
[2]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. doi: 10.48550/arXiv.2305.10601
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.10601 2023
-
[3]
A survey on large language model based autonomous agents , volume =
Lei Wang, Chen Ma, et al. A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science, 18(6):186345, 2024. doi: 10.1007/s11704-024-40231-1
-
[4]
Anthropic. Claude Code Overview, 2026. URL https://code.claude.com/docs/en/ov erview
work page 2026
-
[5]
OpenAI. Codex Documentation, 2026. URLhttps://developers.openai.com/codex
work page 2026
-
[6]
Gemini CLI Documentation, 2026
Google. Gemini CLI Documentation, 2026. URL https://google-gemini.github.io/ gemini-cli/docs/
work page 2026
-
[7]
Kimi. Kimi CLI Documentation, 2026. URL https://moonshotai.github.io/kimi-cli /en/guides/getting-started.html
work page 2026
-
[8]
SKILL.md Specification and Progressive Disclosure Mechanism, 2026
Agent Skills. SKILL.md Specification and Progressive Disclosure Mechanism, 2026. URL ht tps://deepwiki.com/agentskills/agentskills/2.2-skill.md-specification
work page 2026
-
[9]
Renjun Xu, Yang Yan, et al. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward, 2026
work page 2026
-
[10]
Anthropic Skills: Public repository for Agent Skills, 2026
Anthropic. Anthropic Skills: Public repository for Agent Skills, 2026. URL https://github .com/anthropics/skills
work page 2026
-
[11]
Everything Claude Code: The agent harness performance optimization system, 2026
Affan-m. Everything Claude Code: The agent harness performance optimization system, 2026. URLhttps://github.com/affaan-m/everything-claude-code
work page 2026
-
[12]
Sentry Skills: Agent Skills used by the Sentry team for development, 2026
getSentry. Sentry Skills: Agent Skills used by the Sentry team for development, 2026. URL https://github.com/getsentry/skills
work page 2026
-
[13]
Does Prompt Formatting Have Any Impact on LLM Performance?, 2024
Jia He, Mukund Rungta, et al. Does Prompt Formatting Have Any Impact on LLM Performance?, 2024
work page 2024
-
[14]
Claude API Docs: Prompting Best Practices — Structure Prompts with XML Tags,
Anthropic. Claude API Docs: Prompting Best Practices — Structure Prompts with XML Tags,
-
[15]
URL https://platform.claude.com/docs/en/build-with-claude/prompt-e ngineering/claude-prompting-best-practices
-
[16]
Structured Outputs and Format Tax Elimination, 2025
OpenAI. Structured Outputs and Format Tax Elimination, 2025. URL https://platform.o penai.com/docs/guides/structured-outputs. 10
work page 2025
-
[17]
Which Nested Data Format Do LLMs Understand Best? JSON vs
Improving Agents. Which Nested Data Format Do LLMs Understand Best? JSON vs. Y AML vs. XML vs. Markdown, 2025. URL https://www.improvingagents.com/blog/best-n ested-data-format/
work page 2025
-
[18]
Luca Beurer-Kellner, Alexey Kudrinskii, et al. Snyk Finds Prompt Injection in 36%, 1467 Malicious Payloads in a ToxicSkills Study of Agent Skills Supply Chain Compromise, 2026. URL https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub /
work page 2026
-
[19]
Agent Skills. SKILL.md Explained: How to Structure Your Product for AI Agents — Add Guardrails and Common Pitfalls, 2026. URL https://www.gitbook.com/blog/skill-md
work page 2026
-
[20]
Model Context Protocol (MCP) Specification, 2025
Anthropic. Model Context Protocol (MCP) Specification, 2025. URL https://modelconte xtprotocol.io/docs/
work page 2025
-
[21]
Aho, Ravi Sethi, et al.Compilers: Principles, Techniques, and Tools
Alfred V . Aho, Ravi Sethi, et al.Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA, 1986. ISBN 0-201-10088-6
work page 1986
-
[22]
Muchnick.Advanced Compiler Design and Implementation
Steven S. Muchnick.Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, CA, 1997. ISBN 1-55860-320-4
work page 1997
-
[23]
John Strong, Joseph Wegstein, et al. The Problem of Programming Communication with Changing Machines: A Proposed Solution.Communications of the ACM, 1(8):12–18, 1958. doi: 10.1145/368892.368915
-
[24]
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. InInternational Symposium on Code Generation and Optimization (CGO), 2004. doi: 10.1109/CGO.2004.1281665
-
[25]
MLIR: Scaling Compiler Infrastructure for Domain Specific Computation
Chris Lattner, Mehdi Amini, et al. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. InInternational Symposium on Code Generation and Optimization (CGO), 2021. doi: 10.1109/CGO51591.2021.9370308
-
[26]
Laszlo Szekeres, Mathias Payer, et al. SoK: Eternal War in Memory. InIEEE Symposium on Security and Privacy (S&P), 2013. doi: 10.1109/SP.2013.13
-
[27]
Anthropic’s Official Take on XML-Structured Prompting as the Core Strategy, 2026
Reddit r/ClaudeAI. Anthropic’s Official Take on XML-Structured Prompting as the Core Strategy, 2026. URLhttps://www.reddit.com/r/ClaudeAI/comments/1psxuv7/
work page 2026
-
[28]
Roy Philip. JSON vs. XML: A Data-Driven Analysis of LLM Parsing Efficiency, 2025. URL https://royphilip.xyz/blog/json-vs-xml-llm-showdown
work page 2025
-
[29]
Prompt Engineering Across the OpenAI, Anthropic, and Gemini APIs, 2026
Steve Kinney. Prompt Engineering Across the OpenAI, Anthropic, and Gemini APIs, 2026. URLhttps://stevekinney.com/writing/prompt-engineering-frontier-llms
work page 2026
-
[30]
Yuanye Liu, Jiahang Xu, et al. Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization, 2025
work page 2025
-
[31]
Towards Learning Boulder Excavation with Hydraulic Excavators
Renxi Wang, Xudong Han, et al. ToolGen: Unified Tool Retrieval and Calling via Generation. InInternational Conference on Learning Representations (ICLR), 2025. doi: 10.48550/arXiv.2 410.03439
-
[32]
Skill Retrieval Augmentation for Agentic AI, 2026
Weihang Su, Jianming Long, et al. Skill Retrieval Augmentation for Agentic AI, 2026
work page 2026
-
[33]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization, 2024
Darren Edge, Ha Trinh, et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization, 2024
work page 2024
-
[34]
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InConference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. doi: 10.18653/v1/D19-1410
-
[35]
Yujian Liu, Jiabao Ji, et al. How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings, 2026. 11
work page 2026
-
[36]
Agentic Code Optimization via Compiler-LLM Cooperation, 2026
Benjamin Mikek, Danylo Vashchilenko, et al. Agentic Code Optimization via Compiler-LLM Cooperation, 2026
work page 2026
-
[37]
Mahoney, Kurt Keutzer, and Amir Gholami
Sehoon Kim, Suhong Moon, et al. An LLM Compiler for Parallel Function Calling. In International Conference on Machine Learning (ICML), 2024. doi: 10.48550/arXiv.2312.04511
-
[38]
SkVM: Revisiting Language VM for Skills Across Heterogeneous LLMs and Harnesses, 2026
Le Chen, Erhu Feng, et al. SkVM: Revisiting Language VM for Skills Across Heterogeneous LLMs and Harnesses, 2026
work page 2026
-
[39]
A. B. V . Kumar. Deep Dive SKILL.md (Part 1/2): Negative Boundaries and Triggering Accuracy,
-
[40]
URL https://abvijaykumar.medium.com/deep-dive-skill-md-part-1-2-0 9fc9a536996
-
[41]
SecPI: Secure code generation with reasoning models via security reasoning internalization, 2026
Hao Wang, Niels Mündler, et al. SecPI: Secure code generation with reasoning models via security reasoning internalization, 2026
work page 2026
-
[42]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026
Xiangyi Li, Wenbo Chen, et al. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026
work page 2026
-
[43]
UI/UX Pro Max Skill: An Agent Skill for UI/UX design tasks, 2026
NextLevelBuilder. UI/UX Pro Max Skill: An Agent Skill for UI/UX design tasks, 2026. URL https://github.com/nextlevelbuilder/ui-ux-pro-max-skill
work page 2026
-
[44]
Harbor: A Framework for Evaluating and Optimizing Agents and Models in Container Environments, 2026
Harbor Framework Team. Harbor: A Framework for Evaluating and Optimizing Agents and Models in Container Environments, 2026. URL https://github.com/harbor-framework /harbor
work page 2026
-
[45]
The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Xingyao Wang, Simon Rosenberg, et al. The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents. InConference on Machine Learning and Systems (MLSys), 2026. doi: 10.48550/arXiv.2511.03690. A Implementation Details SKCC is implemented in Rust and organized into four crates: nexa-skill-cli.CLI entry point using clap for ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.03690 2026
-
[46]
**[ CRITICAL ]** 30## Pa ra met er Schema ( YAML O pt im ize d ) 15 31‘‘‘ yaml 32type : object 33p r o p e r t i e s : 34m i g r a t i o n _ c o n f i g : 35type : object 36p r o p e r t i e s : 37s our ce _d b : 38type : object 39p r o p e r t i e s : 40host : { type : string } 41‘‘‘ 42 43Kimi : # data - mi gr at io n 44## D e s c r i p t i o n 45... 46#...
-
[47]
**[ CRITICAL ]** 48## Pa ra met er Schema 49- ‘ m i g r a t i o n _ c o n f i g . so ur ce _d b . host ‘ ( string ) : ... D Complete Experimental Data D.1 Four-Model Comparison Summary Table 7: Four-Model Comparison Summary Model Paired∆Rwd.p dVerdict claude-opus-4-6 22–27+0.26–0.270.0096** 0.59–0.60SKCC≫Baseline kimi-k2.5 74+0.1420.0063** 0.33SKCC>Baseli...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.