pith. sign in

arxiv: 2605.22634 · v1 · pith:UFIATGT5new · submitted 2026-05-21 · 💻 cs.SE · cs.AI

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Pith reviewed 2026-05-22 03:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords contractual skillsGovernSpecenterprise AI agentsskill design frameworktask contractsgovernance layertool callingmaintainability
0
0 comments X

The pith

Contractual skills turn agent instructions into inspectable task contracts for enterprise governance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes contractual skills as a design framework that organizes SKILL.md files to express not only task guidance but also goals, input boundaries, permissions, evidence needs, output contracts, quality criteria, verification steps, approval points, and handoff rules. This structure keeps skills lightweight for discovery and loading while adding explicit, readable contract elements drawn from GovernSpec ideas. Two offline experiments test the approach: one generates 960 outputs across three skills, fifteen tasks, four conditions, and eight models, and another records 192 tool-call attempts. Contractual skills beat no-skill and minimal-skill baselines on all models and cut high-risk tool calls, yet show only small mixed gains over information-rich plain expanded skills. The results position the framework as a governance layer that clarifies intent and boundaries rather than a standalone performance or safety booster.

Core claim

Contractual skills package agent instructions as readable task contracts in SKILL.md files. They add explicit fields for goals, boundaries, permissions, evidence, outputs, quality, verification, approvals, and handoffs while preserving lightweight discovery. In text-generation tests they outperform no-skill and minimal-skill baselines across all eight models; relative to detailed plain skills the gains are small and mixed. In tool-calling tests they usually reduce risky calls but model variation persists and runtime guardrails remain necessary. The framework therefore functions as an explicit governance layer rather than a replacement for other controls.

What carries the argument

Contractual skill structure in SKILL.md files, which adds GovernSpec-inspired contract fields to standard skill packaging.

If this is right

  • Contractual skills improve checkability and maintainability of enterprise agent tasks.
  • They reduce some high-risk tool attempts but still require separate runtime guardrails.
  • Model-specific differences in tool behavior remain visible even with contractual structure.
  • The main value lies in making task intent and acceptance criteria explicit for human review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The structure could be combined with live tracing systems to create auditable agent histories without extra overhead.
  • Enterprise teams might adopt it first for high-stakes workflows where handoff rules and approval points matter most.
  • Testing the same fields in dynamic multi-agent environments could reveal whether the contract format scales beyond single-skill use.

Load-bearing premise

Offline results from synthetic tasks and simulated tool calls will translate to practical gains in real enterprise deployments and governance.

What would settle it

A production enterprise deployment in which contractual skills produce no measurable improvement in audit logs, error tracing, or maintenance effort compared with information-rich plain skills.

Figures

Figures reproduced from arXiv: 2605.22634 by Ting Liu.

Figure 1
Figure 1. Figure 1: Contractual skills sit between a structured task contract and runtime enforcement. They [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cross-judge text-generation scores by model and instruction condition. Contractual skills [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Matched score differences for the contractual condition. The gain over no-skill is positive [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: High-risk tool attempts in the offline tool-calling challenge. Skills usually reduce risky [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Risk and completion trade-off under the contractual condition. Some models avoid [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Skills are increasingly used to package agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, skills often need to express more than task guidance: they must make goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules inspectable. This paper proposes contractual skills, a GovernSpec-inspired design framework for organizing SKILL.md files as readable task contracts while preserving lightweight skill discovery and progressive loading. The framework clarifies the boundary between contractual skills, GovernSpec YAML contracts, Model Context Protocol surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems. We evaluate the framework with two offline experiments. A text-generation study covers three enterprise skills, fifteen synthetic tasks, four instruction conditions, and eight generation models, yielding 960 outputs and 1680 cross-judge score records. Contractual skills outperform no-skill and minimal-skill baselines on all tested models. Relative to information-rich plain expanded skills, the gains are small and mixed, suggesting that contractual fields mainly improve checkability and maintainability rather than raw generation quality. A tool-calling challenge covers eight models and 192 simulated tool-call records. Skills usually reduce high-risk tool attempts, but model differences remain and runtime tool guardrails are still required. The results suggest that contractual skills are best understood as a governance layer that makes task intent, boundaries, and acceptance criteria explicit, not as a standalone safety mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes contractual skills as a GovernSpec-inspired design framework for structuring SKILL.md files in enterprise AI agents. The framework organizes instructions to explicitly include goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules, while preserving lightweight discovery and progressive loading. It distinguishes contractual skills from GovernSpec YAML contracts, Model Context Protocol surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems. Evaluation uses two offline experiments: a text-generation study with three enterprise skills, fifteen synthetic tasks, four instruction conditions, and eight models (960 outputs, 1680 cross-judge scores) showing outperformance over no-skill and minimal-skill baselines but small/mixed gains versus information-rich expanded skills; and a tool-calling study with eight models and 192 simulated tool-call records indicating reduced high-risk tool attempts. The authors conclude that contractual skills function primarily as a governance layer improving checkability and maintainability rather than raw generation quality.

Significance. If the framework holds, it offers a practical structure for making enterprise AI agent skills more auditable and governable, potentially aiding compliance and human oversight in production deployments. The experiments provide concrete evidence of benefits over minimal baselines and some reduction in risky tool use across models. The work contributes a clear conceptual separation of concerns that could guide practitioners. However, significance is moderated because the core interpretive claim about checkability and maintainability rests on inference from generation results rather than direct measurement, and the studies use synthetic tasks and simulated calls.

major comments (2)
  1. [Abstract] Abstract: The central interpretive claim that 'contractual fields mainly improve checkability and maintainability rather than raw generation quality' and function as a 'governance layer' is not supported by direct evidence. The text-generation and tool-calling experiments measure only output scores and high-risk tool attempts; no quantitative or qualitative data on inspection time, ambiguity detection, audit trail usability, or compliance verification accuracy are reported. This inference from the absence of large gains versus expanded skills is load-bearing for the framework's primary value proposition.
  2. [Evaluation] Evaluation (text-generation study): The reported 960 outputs and 1680 cross-judge scores demonstrate outperformance over baselines, but the manuscript provides no details on statistical tests, inter-rater reliability, confidence intervals, or access to raw data. This limits assessment of whether the small/mixed gains versus expanded skills are robust or merely noise, directly affecting the strength of the governance-layer conclusion.
minor comments (2)
  1. [Framework description] The boundary clarifications among contractual skills, GovernSpec YAML, MCP surfaces, and guardrails are useful but would be strengthened by a comparative table or diagram in the framework section.
  2. [Evaluation] Specify the exact eight generation models and any prompt templates used in the experiments to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on strengthening the evidential basis for our interpretive claims and on providing fuller statistical reporting. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central interpretive claim that 'contractual fields mainly improve checkability and maintainability rather than raw generation quality' and function as a 'governance layer' is not supported by direct evidence. The text-generation and tool-calling experiments measure only output scores and high-risk tool attempts; no quantitative or qualitative data on inspection time, ambiguity detection, audit trail usability, or compliance verification accuracy are reported. This inference from the absence of large gains versus expanded skills is load-bearing for the framework's primary value proposition.

    Authors: We agree that direct measurements of checkability (e.g., inspection time, ambiguity detection rates, or audit accuracy) would constitute stronger evidence. The experiments were deliberately scoped to first establish that contractual skills preserve or improve generation quality relative to baselines; the small and mixed gains versus information-rich expanded skills then provide indirect support for interpreting the primary value as governance rather than performance enhancement. We accept that this remains an inference. In revision we will (1) rephrase the abstract and conclusion to present the governance-layer interpretation as a supported hypothesis rather than a definitive conclusion, (2) add an explicit limitations paragraph noting the absence of direct usability metrics, and (3) outline planned follow-up studies that would collect inspection-time and audit-trail data. These changes address the load-bearing concern without altering the experimental results. revision: partial

  2. Referee: [Evaluation] Evaluation (text-generation study): The reported 960 outputs and 1680 cross-judge scores demonstrate outperformance over baselines, but the manuscript provides no details on statistical tests, inter-rater reliability, confidence intervals, or access to raw data. This limits assessment of whether the small/mixed gains versus expanded skills are robust or merely noise, directly affecting the strength of the governance-layer conclusion.

    Authors: We concur that the current manuscript lacks the statistical detail needed for readers to evaluate robustness. In the revised version we will insert a dedicated statistical analysis subsection that reports: (a) results of appropriate paired tests (Wilcoxon signed-rank or t-tests) with p-values for all condition comparisons, (b) inter-rater reliability via Fleiss’ kappa across the three judges, (c) 95 % confidence intervals around mean scores for each condition and model, and (d) a link to a public repository containing the raw 960 outputs, 1680 scores, and analysis scripts. These additions will allow direct assessment of whether the observed patterns are statistically reliable. revision: yes

Circularity Check

0 steps flagged

No circularity: framework definition independent of experimental outcomes

full rationale

The paper defines contractual skills as a GovernSpec-inspired organization of SKILL.md files that makes goals, boundaries, permissions, and verification steps explicit. It then reports separate offline experiments (960 generations, 1680 cross-judge scores, 192 simulated tool calls) using external models and synthetic tasks. The interpretive claim that contractual fields 'mainly improve checkability and maintainability' is presented as a suggestion drawn from the observed small/mixed gains versus expanded skills, not as a premise built into the framework definition itself. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the provided text. The evaluation metrics (generation scores, high-risk tool attempts) are distinct from the framework's structural elements, so the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper is a design framework proposal evaluated empirically; it introduces no mathematical free parameters, standard axioms, or new physical entities. The central contribution is the contractual skills concept itself, which functions as an invented structuring approach whose value is tested through the described experiments.

invented entities (1)
  • contractual skills no independent evidence
    purpose: Organizing SKILL.md files as readable task contracts that make goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules inspectable.
    New design framework introduced to address enterprise needs for explicit governance in agent skills; no independent falsifiable evidence outside the paper's own experiments is provided.

pith-pipeline@v0.9.0 · 5787 in / 1357 out tokens · 37392 ms · 2026-05-22T03:43:59.676865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Agent skills.https://docs.claude.com/en/docs/agents-and-tools/ agent-skills, n.d

    Anthropic. Agent skills.https://docs.claude.com/en/docs/agents-and-tools/ agent-skills, n.d. Accessed: 2026-05-21

  2. [2]

    SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement

    Zenghao Duan, Yuxin Tian, Zhiyi Yin, Liang Pang, Jingcheng Deng, Zihao Wei, Shicheng Xu, Yuyao Ge, and Xueqi Cheng. Skillattack: Automated red teaming of agent skills through attack path refinement, 2026. URLhttps://arxiv.org/abs/2604.04989. arXiv:2604.04989

  3. [3]

    Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

    Zhiyuan Li, Jingzheng Wu, Xiang Ling, Xing Cui, and Tianyue Luo. Towards secure agent skills: Architecture, threat taxonomy, and security analysis, 2026. URLhttps://arxiv.org/ abs/2604.02837. arXiv:2604.02837

  4. [4]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher R´ e, Diana Acosta-Navas, Drew A. Hudson, et al. Holistic evaluation of language models.Trans- actions...

  5. [5]

    Agent skills: A data-driven analysis of claude skills for extending large language model functionality, 2026

    George Ling, Shanshan Zhong, and Richard Huang. Agent skills: A data-driven analysis of claude skills for extending large language model functionality, 2026. URLhttps://arxiv. org/abs/2602.08004. arXiv:2602.08004

  6. [6]

    Governspec: Runtime-independent contract compilation and offline acceptance testing for ai agent artifacts, 2026

    Ting Liu. Governspec: Runtime-independent contract compilation and offline acceptance testing for ai agent artifacts, 2026. URLhttps://ssrn.com/abstract=6674899. SSRN, April 29, 2026. 13

  7. [7]

    Prompts.https://modelcontextprotocol.io/specification/ 2025-06-18/server/prompts, 2025

    Model Context Protocol. Prompts.https://modelcontextprotocol.io/specification/ 2025-06-18/server/prompts, 2025. Accessed: 2026-05-21

  8. [8]

    Function calling.https://platform.openai.com/docs/guides/ function-calling, n.d

    OpenAI. Function calling.https://platform.openai.com/docs/guides/ function-calling, n.d.. Accessed: 2026-05-21

  9. [9]

    Guardrails.https://openai.github.io/openai-agents-js/guides/ guardrails/, n.d

    OpenAI. Guardrails.https://openai.github.io/openai-agents-js/guides/ guardrails/, n.d.. Accessed: 2026-05-21

  10. [10]

    Tracing.https://openai.github.io/openai-agents-python/tracing/, n.d

    OpenAI. Tracing.https://openai.github.io/openai-agents-python/tracing/, n.d.. Ac- cessed: 2026-05-21

  11. [11]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. URLhttps://arxiv.org/abs/2210.03629. 14