pith. machine review for the scientific record. sign in

arxiv: 2605.03353 · v2 · submitted 2026-05-05 · 💻 cs.CR · cs.AI

Recognition: no theorem link

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsskill compilationintermediate representationportabilitysecuritycross-frameworkcompilation pipeline
0
0 comments X

The pith

SkCC compiles LLM agent skills into a portable intermediate representation that works across frameworks with built-in security.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agent skills written in Markdown vary widely in performance across different frameworks because of prompt formatting differences. The paper presents SkCC as a compiler that translates these skills into SkIR, a strongly-typed intermediate representation, to decouple the skill logic from any specific framework. This allows one skill to be adapted to many frameworks efficiently. A static optimizer adds security enforcement at compile time. Experiments show improved success rates, low compilation time, and token savings.

Core claim

The central discovery is that introducing classical compilation techniques, centered on a strongly-typed IR called SkIR, allows skill semantics to be separated from framework formatting. This enables a four-phase pipeline that reduces adaptation complexity from O(m × n) to O(m + n), while the Optimizer blocks vulnerabilities before deployment.

What carries the argument

SkIR, the strongly-typed intermediate representation that captures the full semantics of skills independently of framework-specific prompt formats.

If this is right

  • Adaptation effort scales linearly with the number of skills and frameworks rather than quadratically.
  • Security vulnerabilities are detected and blocked proactively with a 94.8% trigger rate.
  • Consistent pass rate improvements across tested frameworks like Claude Code and Kimi CLI.
  • Sub-10ms compilation latency and runtime token reductions of 10-46%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Skill authors could create once and reach all frameworks without additional work.
  • It opens the possibility for a unified skill ecosystem or marketplace.
  • Similar compilation approaches might address portability in other prompt engineering domains.

Load-bearing premise

One strongly-typed IR is sufficient to represent the semantics of any Markdown-written skill without losing necessary details or expressiveness.

What would settle it

A test where SkCC-compiled skills perform no better than originals or still need per-framework tweaks on new frameworks would disprove the portability benefit.

Figures

Figures reproduced from arXiv: 2605.03353 by Xianwei Zhang, Yipeng Ouyang, Yi Xiao, Yuhao Gu.

Figure 1
Figure 1. Figure 1: Left: Agent workflow with SKCC integration. Skills are authored once as SKILL.md, com￾piled to framework-native formats, and loaded via progressive routing manifests at agent initialization. Right: Adaptation complexity reduction from O(m × n) to O(m + n). Traditional per-framework rewriting requires m × n manual adaptations. SKCC decouples skills and frameworks through a unified IR, requiring only m skill… view at source ↗
Figure 2
Figure 2. Figure 2: SKCC’s four-phase compilation pipeline. A unified SKILL.md source is parsed into a raw AST (Syntax Parser), transformed into a strongly-typed SKIR (IR Builder), validated and optimized by compile-time security analysis (Security Optimizer), and emitted into framework-native formats (Target Emitters). A representative capability of the IR level is nested data detection: when a skill declares schemas with ne… view at source ↗
Figure 3
Figure 3. Figure 3: Pass rate and mean reward comparison of Baseline vs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average relative pass rate improve￾ment across methods. 4.2.2 Security: Injection Trigger and Compilation Interception [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cross-framework token and time efficiency heatmap. SKCC skills show con￾sistent reductions in total tokens and execu￾tion time across all frameworks. Claude token counts are reported in hundreds due to API measurement differences. Static Expansion vs. Dynamic Efficiency. Compilation introduces static structural overhead from XML tags, Anti-Skill constraints, and format hardening (ranging from +4% on Kimi t… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study radar chart: the same Kimi-compiled format produces divergent effects [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

LLM agents increasingly rely on reusable skills (e.g., `SKILL.md`) to execute complex tasks, yet these artifacts lack portability: agent frameworks are highly sensitive to prompt formatting, leading to a large performance variation for the same skill. Nevertheless, most skills are authored once as format-agnostic Markdown, necessitating costly per-framework rewrites and also leaving security largely unaddressed, with widespread vulnerabilities in practice. To address this, we present SkCC, a compiler for LLM agents that introduces classical compilation design into agent skill development. SkCC centers on SkIR, a strongly-typed intermediate representation that decouples skill semantics from framework-specific formatting, thus enabling portable deployment across agent frameworks. Atop of this IR, a static Optimizer enforces security constraints, blocking vulnerabilities before deployment. Implemented as a four-phase pipeline, SkCC effectively reduces adaptation complexity from $O(m \times n)$ to $O(m + n)$ across $m$ skills and $n$ frameworks. Experiments on SkillsBench demonstrate that SkCC delivers consistent and substantial gains over original counterparts, with pass rate increases from 21.1% to 33.3% on Claude Code and from 35.1% to 48.7% on Kimi CLI. Further, the design achieves sub-10ms compilation latency, 94.8% proactive security trigger rate, and 10-46% runtime token savings across frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents SkCC, a compiler for LLM agent skills that introduces a strongly-typed intermediate representation (SkIR) to decouple skill semantics from framework-specific formatting. It describes a four-phase pipeline that reduces adaptation complexity from O(m × n) to O(m + n) across m skills and n frameworks, includes a static optimizer to enforce security constraints before deployment, and reports experimental results on SkillsBench showing pass-rate gains (21.1% to 33.3% on Claude Code; 35.1% to 48.7% on Kimi CLI), sub-10 ms compilation latency, 94.8% proactive security trigger rate, and 10-46% runtime token savings.

Significance. If the central claims hold, the work would be significant for standardizing skill development in LLM agents, offering a practical path to portability and proactive security via classical compilation techniques. The four-phase pipeline design and cross-framework empirical evaluation on SkillsBench provide concrete evidence of reduced engineering effort and measurable efficiency gains, which could influence how reusable skills are authored and deployed in agent systems.

major comments (2)
  1. [Abstract and SkIR/pipeline description] The O(m + n) complexity reduction and portability guarantee rest on the assumption that SkIR can represent the complete semantics and intent of arbitrary natural-language Markdown skills without loss or the need for framework-specific extensions. No formal grammar, type definitions, or coverage analysis for edge cases (e.g., implicit context, conditional logic, or ambiguous instructions) is supplied in the SkIR definition or pipeline description, leaving open the possibility that expressiveness gaps would reintroduce per-framework adaptations.
  2. [Experimental evaluation] The reported performance and security metrics (pass-rate increases, 94.8% trigger rate, token savings) are load-bearing for the practicality claims, yet the experimental section provides no details on SkillsBench composition, data splits, baseline implementations, number of skills/frameworks tested, or statistical tests. This prevents verification that the gains are consistent and attributable to the compiler rather than experimental artifacts.
minor comments (2)
  1. [Abstract] The abstract introduces the four-phase pipeline but does not name the phases; a one-sentence enumeration would improve immediate clarity without lengthening the abstract.
  2. [Introduction] Acronyms SkCC and SkIR should be expanded on first use in the main body even if defined in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will strengthen the manuscript while defending the core contributions on the basis of the presented design and results.

read point-by-point responses
  1. Referee: The O(m + n) complexity reduction and portability guarantee rest on the assumption that SkIR can represent the complete semantics and intent of arbitrary natural-language Markdown skills without loss or the need for framework-specific extensions. No formal grammar, type definitions, or coverage analysis for edge cases (e.g., implicit context, conditional logic, or ambiguous instructions) is supplied in the SkIR definition or pipeline description, leaving open the possibility that expressiveness gaps would reintroduce per-framework adaptations.

    Authors: We acknowledge that the current manuscript presents SkIR at a descriptive level with illustrative examples rather than a complete formal grammar or exhaustive coverage analysis. The strongly-typed nature of SkIR is intended to capture core semantics (control flow, data dependencies, and security-relevant operations) through its type system, and the four-phase pipeline is designed to preserve these semantics during lowering. The empirical results on SkillsBench demonstrate that the evaluated skills compiled and executed portably without requiring framework-specific extensions, supporting the O(m + n) claim in practice. To address the concern directly, the revised manuscript will include an explicit formal grammar for SkIR, complete type definitions, and a dedicated subsection analyzing coverage of edge cases such as conditional logic and ambiguous instructions, drawing on the skills present in the benchmark. revision: yes

  2. Referee: The reported performance and security metrics (pass-rate increases, 94.8% trigger rate, token savings) are load-bearing for the practicality claims, yet the experimental section provides no details on SkillsBench composition, data splits, baseline implementations, number of skills/frameworks tested, or statistical tests. This prevents verification that the gains are consistent and attributable to the compiler rather than experimental artifacts.

    Authors: We agree that the experimental section must be expanded for reproducibility and to allow readers to verify that observed gains are attributable to SkCC. The manuscript currently reports aggregate pass-rate, latency, security-trigger, and token-saving figures but does not detail the benchmark composition, evaluation protocol, baseline implementations, exact numbers of skills and frameworks, or statistical tests. In the revision we will add: (i) a full description of SkillsBench including skill count, framework coverage, and task categories; (ii) the evaluation protocol and any data splits used; (iii) precise baseline implementations; (iv) the number of trials per configuration; and (v) statistical significance tests (e.g., paired t-tests with p-values) confirming that the reported improvements are consistent and not artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: design claim and empirical results are independent

full rationale

The abstract and provided text introduce SkCC as a four-phase compiler pipeline centered on a new strongly-typed SkIR that decouples Markdown skill semantics from framework formatting. The O(m+n) complexity reduction is a direct consequence of successful decoupling (standard compiler benefit) rather than a fitted or self-referential quantity. Reported gains (pass-rate improvements, sub-10ms latency, 94.8% security trigger rate, token savings) are presented as experimental measurements on SkillsBench, not as predictions derived from parameters fitted to the same data or from self-citations. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear in the supplied material that would collapse any central claim back to its inputs by construction. The design's correctness therefore rests on external validation rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract introduces SkIR and the optimizer as new constructs without enumerating free parameters or background axioms; the design implicitly assumes that skill semantics are fully expressible in a typed IR and that static analysis suffices for security.

invented entities (2)
  • SkIR no independent evidence
    purpose: strongly-typed intermediate representation that decouples skill semantics from framework-specific formatting
    Newly defined in the paper to enable portability.
  • SkCC Optimizer no independent evidence
    purpose: static enforcement of security constraints before deployment
    New component introduced for proactive vulnerability blocking.

pith-pipeline@v0.9.0 · 5560 in / 1447 out tokens · 60279 ms · 2026-05-12T01:45:17.913725+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

  1. [1]

    Jennings

    Michael Wooldridge and Nicholas R. Jennings. Intelligent Agents: Theory and Practice.The Knowledge Engineering Review, 10(2):115–152, 1995. doi: 10.1017/S0269888900008122

  2. [2]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. doi: 10.48550/arXiv.2305.10601

  3. [3]

    A survey on large language model based autonomous agents , volume =

    Lei Wang, Chen Ma, et al. A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science, 18(6):186345, 2024. doi: 10.1007/s11704-024-40231-1

  4. [4]

    Claude Code Overview, 2026

    Anthropic. Claude Code Overview, 2026. URL https://code.claude.com/docs/en/ov erview

  5. [5]

    Codex Documentation, 2026

    OpenAI. Codex Documentation, 2026. URLhttps://developers.openai.com/codex

  6. [6]

    Gemini CLI Documentation, 2026

    Google. Gemini CLI Documentation, 2026. URL https://google-gemini.github.io/ gemini-cli/docs/

  7. [7]

    Kimi CLI Documentation, 2026

    Kimi. Kimi CLI Documentation, 2026. URL https://moonshotai.github.io/kimi-cli /en/guides/getting-started.html

  8. [8]

    SKILL.md Specification and Progressive Disclosure Mechanism, 2026

    Agent Skills. SKILL.md Specification and Progressive Disclosure Mechanism, 2026. URL ht tps://deepwiki.com/agentskills/agentskills/2.2-skill.md-specification

  9. [9]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward, 2026

    Renjun Xu, Yang Yan, et al. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward, 2026

  10. [10]

    Anthropic Skills: Public repository for Agent Skills, 2026

    Anthropic. Anthropic Skills: Public repository for Agent Skills, 2026. URL https://github .com/anthropics/skills

  11. [11]

    Everything Claude Code: The agent harness performance optimization system, 2026

    Affan-m. Everything Claude Code: The agent harness performance optimization system, 2026. URLhttps://github.com/affaan-m/everything-claude-code

  12. [12]

    Sentry Skills: Agent Skills used by the Sentry team for development, 2026

    getSentry. Sentry Skills: Agent Skills used by the Sentry team for development, 2026. URL https://github.com/getsentry/skills

  13. [13]

    Does Prompt Formatting Have Any Impact on LLM Performance?, 2024

    Jia He, Mukund Rungta, et al. Does Prompt Formatting Have Any Impact on LLM Performance?, 2024

  14. [14]

    Claude API Docs: Prompting Best Practices — Structure Prompts with XML Tags,

    Anthropic. Claude API Docs: Prompting Best Practices — Structure Prompts with XML Tags,

  15. [15]

    URL https://platform.claude.com/docs/en/build-with-claude/prompt-e ngineering/claude-prompting-best-practices

  16. [16]

    Structured Outputs and Format Tax Elimination, 2025

    OpenAI. Structured Outputs and Format Tax Elimination, 2025. URL https://platform.o penai.com/docs/guides/structured-outputs. 10

  17. [17]

    Which Nested Data Format Do LLMs Understand Best? JSON vs

    Improving Agents. Which Nested Data Format Do LLMs Understand Best? JSON vs. Y AML vs. XML vs. Markdown, 2025. URL https://www.improvingagents.com/blog/best-n ested-data-format/

  18. [18]

    Snyk Finds Prompt Injection in 36%, 1467 Malicious Payloads in a ToxicSkills Study of Agent Skills Supply Chain Compromise, 2026

    Luca Beurer-Kellner, Alexey Kudrinskii, et al. Snyk Finds Prompt Injection in 36%, 1467 Malicious Payloads in a ToxicSkills Study of Agent Skills Supply Chain Compromise, 2026. URL https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub /

  19. [19]

    SKILL.md Explained: How to Structure Your Product for AI Agents — Add Guardrails and Common Pitfalls, 2026

    Agent Skills. SKILL.md Explained: How to Structure Your Product for AI Agents — Add Guardrails and Common Pitfalls, 2026. URL https://www.gitbook.com/blog/skill-md

  20. [20]

    Model Context Protocol (MCP) Specification, 2025

    Anthropic. Model Context Protocol (MCP) Specification, 2025. URL https://modelconte xtprotocol.io/docs/

  21. [21]

    Aho, Ravi Sethi, et al.Compilers: Principles, Techniques, and Tools

    Alfred V . Aho, Ravi Sethi, et al.Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA, 1986. ISBN 0-201-10088-6

  22. [22]

    Muchnick.Advanced Compiler Design and Implementation

    Steven S. Muchnick.Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, CA, 1997. ISBN 1-55860-320-4

  23. [23]

    The Problem of Programming Communication with Changing Machines: A Proposed Solution.Communications of the ACM, 1(8):12–18, 1958

    John Strong, Joseph Wegstein, et al. The Problem of Programming Communication with Changing Machines: A Proposed Solution.Communications of the ACM, 1(8):12–18, 1958. doi: 10.1145/368892.368915

  24. [24]

    LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

    Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. InInternational Symposium on Code Generation and Optimization (CGO), 2004. doi: 10.1109/CGO.2004.1281665

  25. [25]

    MLIR: Scaling Compiler Infrastructure for Domain Specific Computation

    Chris Lattner, Mehdi Amini, et al. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. InInternational Symposium on Code Generation and Optimization (CGO), 2021. doi: 10.1109/CGO51591.2021.9370308

  26. [26]

    SoK: Eternal war in memory

    Laszlo Szekeres, Mathias Payer, et al. SoK: Eternal War in Memory. InIEEE Symposium on Security and Privacy (S&P), 2013. doi: 10.1109/SP.2013.13

  27. [27]

    Anthropic’s Official Take on XML-Structured Prompting as the Core Strategy, 2026

    Reddit r/ClaudeAI. Anthropic’s Official Take on XML-Structured Prompting as the Core Strategy, 2026. URLhttps://www.reddit.com/r/ClaudeAI/comments/1psxuv7/

  28. [28]

    Roy Philip. JSON vs. XML: A Data-Driven Analysis of LLM Parsing Efficiency, 2025. URL https://royphilip.xyz/blog/json-vs-xml-llm-showdown

  29. [29]

    Prompt Engineering Across the OpenAI, Anthropic, and Gemini APIs, 2026

    Steve Kinney. Prompt Engineering Across the OpenAI, Anthropic, and Gemini APIs, 2026. URLhttps://stevekinney.com/writing/prompt-engineering-frontier-llms

  30. [30]

    Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization, 2025

    Yuanye Liu, Jiahang Xu, et al. Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization, 2025

  31. [31]

    Towards Learning Boulder Excavation with Hydraulic Excavators

    Renxi Wang, Xudong Han, et al. ToolGen: Unified Tool Retrieval and Calling via Generation. InInternational Conference on Learning Representations (ICLR), 2025. doi: 10.48550/arXiv.2 410.03439

  32. [32]

    Skill Retrieval Augmentation for Agentic AI, 2026

    Weihang Su, Jianming Long, et al. Skill Retrieval Augmentation for Agentic AI, 2026

  33. [33]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization, 2024

    Darren Edge, Ha Trinh, et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization, 2024

  34. [34]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InConference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. doi: 10.18653/v1/D19-1410

  35. [35]

    How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings, 2026

    Yujian Liu, Jiabao Ji, et al. How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings, 2026. 11

  36. [36]

    Agentic Code Optimization via Compiler-LLM Cooperation, 2026

    Benjamin Mikek, Danylo Vashchilenko, et al. Agentic Code Optimization via Compiler-LLM Cooperation, 2026

  37. [37]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Sehoon Kim, Suhong Moon, et al. An LLM Compiler for Parallel Function Calling. In International Conference on Machine Learning (ICML), 2024. doi: 10.48550/arXiv.2312.04511

  38. [38]

    SkVM: Revisiting Language VM for Skills Across Heterogeneous LLMs and Harnesses, 2026

    Le Chen, Erhu Feng, et al. SkVM: Revisiting Language VM for Skills Across Heterogeneous LLMs and Harnesses, 2026

  39. [39]

    A. B. V . Kumar. Deep Dive SKILL.md (Part 1/2): Negative Boundaries and Triggering Accuracy,

  40. [40]

    URL https://abvijaykumar.medium.com/deep-dive-skill-md-part-1-2-0 9fc9a536996

  41. [41]

    SecPI: Secure code generation with reasoning models via security reasoning internalization, 2026

    Hao Wang, Niels Mündler, et al. SecPI: Secure code generation with reasoning models via security reasoning internalization, 2026

  42. [42]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026

    Xiangyi Li, Wenbo Chen, et al. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026

  43. [43]

    UI/UX Pro Max Skill: An Agent Skill for UI/UX design tasks, 2026

    NextLevelBuilder. UI/UX Pro Max Skill: An Agent Skill for UI/UX design tasks, 2026. URL https://github.com/nextlevelbuilder/ui-ux-pro-max-skill

  44. [44]

    Harbor: A Framework for Evaluating and Optimizing Agents and Models in Container Environments, 2026

    Harbor Framework Team. Harbor: A Framework for Evaluating and Optimizing Agents and Models in Container Environments, 2026. URL https://github.com/harbor-framework /harbor

  45. [45]

    The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

    Xingyao Wang, Simon Rosenberg, et al. The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents. InConference on Machine Learning and Systems (MLSys), 2026. doi: 10.48550/arXiv.2511.03690. A Implementation Details SKCC is implemented in Rust and organized into four crates: nexa-skill-cli.CLI entry point using clap for ...

  46. [46]

    46## P r o c e d u r e s

    **[ CRITICAL ]** 30## Pa ra met er Schema ( YAML O pt im ize d ) 15 31‘‘‘ yaml 32type : object 33p r o p e r t i e s : 34m i g r a t i o n _ c o n f i g : 35type : object 36p r o p e r t i e s : 37s our ce _d b : 38type : object 39p r o p e r t i e s : 40host : { type : string } 41‘‘‘ 42 43Kimi : # data - mi gr at io n 44## D e s c r i p t i o n 45... 46#...

  47. [47]

    w/ Reduction

    **[ CRITICAL ]** 48## Pa ra met er Schema 49- ‘ m i g r a t i o n _ c o n f i g . so ur ce _d b . host ‘ ( string ) : ... D Complete Experimental Data D.1 Four-Model Comparison Summary Table 7: Four-Model Comparison Summary Model Paired∆Rwd.p dVerdict claude-opus-4-6 22–27+0.26–0.270.0096** 0.59–0.60SKCC≫Baseline kimi-k2.5 74+0.1420.0063** 0.33SKCC>Baseli...