pith. sign in

arxiv: 2606.17099 · v1 · pith:G6VL3OWCnew · submitted 2026-06-14 · 💻 cs.SE · cs.AI

Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work

Pith reviewed 2026-06-27 04:14 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords delegation contractsAI coding agentsreviewabilityevidence sufficiencysoftware delegationTypeScript tasksmodel-based reviewersacceptance tests
0
0 comments X

The pith

Explicit delegation contracts improve reviewability of AI coding agent work without changing objective task outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests explicit software delegation contracts as a way to structure tasks given to AI coding agents. In a controlled pilot with 64 runs across ten small TypeScript tasks that contained seeded defects, all executions passed hidden acceptance checks with no scope violations regardless of prompt style. Contracts raised evidence sufficiency in 22 of 30 paired comparisons and lowered reviewer ambiguity, while also causing required sections such as known limitations and residual risk to appear. These gains came at the cost of 13 percent more tokens and 38 percent more wall-clock time, with larger effects on the weaker model. The results indicate that contracts mainly enhance how reviewable the returned work is rather than its correctness.

Core claim

In this pilot, explicit delegation contracts did not alter the fact that every one of the 64 agent runs passed hidden acceptance tests with zero scope violations, yet they produced measurable gains in reviewability: evidence sufficiency rose by 0.83 points on a 5-point scale, reviewer ambiguity fell, and contract-specified sections such as changed-file lists and reviewer checklists appeared mostly or exclusively when the contract format required them.

What carries the argument

The software delegation contract, defined as the unit covering the task, authority bounds, returned work package, and acceptance context, used to structure and analyze delegated coding work.

Load-bearing premise

The assumption that results from a dependency-free TypeScript environment with ten small seeded-defect tasks and model-based reviewers serve as a valid proxy for real-world software delegation and human review.

What would settle it

Running the same outputs through human reviewers and measuring whether contracts reduce review time, clarification requests, or missed defects compared with issue-style prompts.

read the original abstract

AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned work package, and acceptance context, as the unit of analysis for delegated coding work, but did not measure its effects. This paper reports a controlled pilot study of explicit delegation contracts for coding agents. We built a dependency-free TypeScript API task environment with seeded defects and documentation gaps, authored ten tasks across five families, and ran 64 agent executions across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, for 192 reviews. Explicit contracts did not improve objective task outcomes: all 64 runs passed hidden acceptance checks, with zero scope violations. They did improve reviewability. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66); reviewer ambiguity decreased (p = 0.035); changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when demanded by the contract. Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier. On these small tasks, delegation contracts bought reviewability rather than correctness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a controlled pilot study of explicit software delegation contracts for AI coding agents. In a dependency-free TypeScript environment with ten seeded-defect tasks, 64 agent executions under three conditions (issue-style prompt, explicit contract, contract with evidence bundle) were evaluated via hidden acceptance tests, mutation checks, scope analysis, and 192 condition-blinded model-based reviews. The central finding is that contracts produced no change in objective outcomes (all 64 runs passed acceptance checks with zero scope violations) but improved reviewability metrics, including evidence sufficiency (+0.83 on a 5-point scale in 22 of 30 paired comparisons, p<0.0001, Cliff's delta=0.66), reduced ambiguity (p=0.035), and greater inclusion of changed-file lists, known-limitations, residual-risk sections, and checklists, at the cost of +13% tokens and +38% wall-clock time.

Significance. If the results hold, the work supplies the first quantitative evidence that delegation contracts can measurably improve reviewability of AI coding work independently of correctness, using paired comparisons, p-values, and effect sizes. The controlled design with blinded reviewers and hidden tests is a strength that could serve as a template for subsequent studies on human-AI software delegation.

major comments (3)
  1. [Abstract] Abstract and Results: The headline claim that contracts 'bought reviewability rather than correctness' rests on an untested assumption. All 64 runs passed hidden acceptance checks with zero scope violations, producing zero variance in the correctness metric; the experiment therefore cannot distinguish whether contracts would have raised, lowered, or left unchanged the failure rate in a regime where baseline failures are observable.
  2. [Methods] Methods: The choice of three condition-blinded model-based reviewers with a fixed rubric, rather than human reviewers, is load-bearing for the reviewability claims (evidence sufficiency, ambiguity reduction). The paper does not report any validation that model-based scores correlate with human reviewer judgments on the same artifacts.
  3. [Discussion] Discussion: The interpretation that contracts improve reviewability 'rather than' correctness is undercut by the ceiling effect; a revised framing limited to the observed data (reviewability gains with no detectable correctness effect under these conditions) would be more proportionate to the evidence.
minor comments (2)
  1. [Abstract] The abstract states that raw data and code are not provided; adding a data-availability statement with repository link would strengthen reproducibility claims.
  2. [Results] Table or figure presenting the per-task and per-model breakdown of the 30 paired comparisons would clarify which conditions drove the +0.83 evidence-sufficiency improvement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our pilot study. We address each major comment below and will make targeted revisions to improve precision and acknowledge limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results: The headline claim that contracts 'bought reviewability rather than correctness' rests on an untested assumption. All 64 runs passed hidden acceptance checks with zero scope violations, producing zero variance in the correctness metric; the experiment therefore cannot distinguish whether contracts would have raised, lowered, or left unchanged the failure rate in a regime where baseline failures are observable.

    Authors: We agree that the pilot exhibits a ceiling effect, with zero variance in correctness outcomes. The abstract phrasing implies a general trade-off that the data cannot support. We will revise the abstract to state that contracts produced reviewability gains with no detectable correctness effect under the tested conditions, removing the 'rather than' framing. revision: yes

  2. Referee: [Methods] Methods: The choice of three condition-blinded model-based reviewers with a fixed rubric, rather than human reviewers, is load-bearing for the reviewability claims (evidence sufficiency, ambiguity reduction). The paper does not report any validation that model-based scores correlate with human reviewer judgments on the same artifacts.

    Authors: The absence of reported correlation between model-based and human judgments is a genuine methodological limitation. We will add an explicit discussion of this choice in the Methods and Limitations sections, including the rationale for using blinded model reviewers (consistency, scalability, and blinding feasibility in the pilot) and noting the lack of human validation as an important avenue for future work. No new validation data will be added at this stage. revision: yes

  3. Referee: [Discussion] Discussion: The interpretation that contracts improve reviewability 'rather than' correctness is undercut by the ceiling effect; a revised framing limited to the observed data (reviewability gains with no detectable correctness effect under these conditions) would be more proportionate to the evidence.

    Authors: We accept this critique. The current interpretation overreaches given the ceiling effect. We will revise the Discussion (and cross-referenced sections) to restrict claims to the observed data, explicitly highlighting the ceiling effect and the pilot-study scope. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The paper's central claims rest on a controlled experiment reporting objective pass rates (all 64 runs passed hidden acceptance checks with zero scope violations) and reviewability metrics (evidence sufficiency improved in 22/30 paired comparisons, p<0.0001). These are direct observations from 192 blinded reviews under three prompt conditions; no equations, fitted parameters, or derivations are present that could reduce outputs to inputs by construction. Prior work is referenced only for background on the contract concept, not as load-bearing justification for the measured effects. The ceiling-effect observation affects interpretation strength but does not create circularity in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study relies on standard statistical assumptions for p-values and effect sizes; no free parameters or invented entities are introduced. The task environment and reviewer rubric are constructed for the study.

axioms (2)
  • standard math Statistical tests (p-values, Cliff's delta) are valid under the sample sizes and distributions used.
    Invoked when reporting p < 0.0001 and p = 0.035.
  • domain assumption Model-based reviewers produce scores comparable to human reviewers for the rubric.
    Used for the 192 reviews; not tested against humans in the abstract.

pith-pipeline@v0.9.1-grok · 5813 in / 1276 out tokens · 38668 ms · 2026-06-27T04:14:55.435783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    All 64 runs passed the hidden acceptance checks; no run violated its authority boundary

    Objective outcomes saturated. All 64 runs passed the hidden acceptance checks; no run violated its authority boundary. On small, well-specified tasks, current models do not need a contract to succeed

  2. [2]

    Evidence sufficiency rose in 22 of 30 paired comparisons and fell in none (+0.83 on a 5-point scale; p < 0.0001; Cliff’s 𝛿 = 0.66)

    Reviewability changed. Evidence sufficiency rose in 22 of 30 paired comparisons and fell in none (+0.83 on a 5-point scale; p < 0.0001; Cliff’s 𝛿 = 0.66). Reviewer ambiguity fell (p = 0.035, judge-level p = 0.0001). Structured report elements appeared almost exclusively under contracts, and some elements (residual risks, reviewer checklists) appeared only ...

  3. [3]

    Agent tokens rose 13%, wall-clock time rose 38%, and tool invocations rose 23% (all p ≤ 0.001)

    Contracts added run cost. Agent tokens rose 13%, wall-clock time rose 38%, and tool invocations rose 23% (all p ≤ 0.001). That overhead bought evidence, not correctness

  4. [4]

    refactor

    The weaker model benefited more , suggesting contracts partially substitute for the reporting discipline stronger models exhibit unprompted. Contributions. (i) A reusable experimental harness for delegation-contract studies, built from a seeded task repository, paired prompts, hidden acceptance and mutation checks, mechanical scope analysis, and a blinded...

  5. [5]

    and meaningful-human-control accounts [10] supply the framing that delegation is a control relationship; human-AI teaming results [7] argue that team value depends on verification cost, which our evidence-sufficiency results operationalize. Studies of provenance and disclosure [11, 14] show that knowing how code was produced changes reviewer behavior; we c...

  6. [6]

    The delegation-contract framework itself, including the ⟨𝑇 , 𝐴, 𝑊 , 𝐶⟩ model and the testing agenda this pilot executes, is developed in [4]

    and automated program repair [16] anticipate the artifact-with-evidence framing. The delegation-contract framework itself, including the ⟨𝑇 , 𝐴, 𝑊 , 𝐶⟩ model and the testing agenda this pilot executes, is developed in [4]. 8 Conclusion We turned the software-delegation-contract framework into a measurement instrument and ran it. In 64 controlled coding-ag...

  7. [7]

    Codex web

    OpenAI Codex Documentation. “Codex web” . https://developers.openai.com/codex/cl oud. Accessed June 12, 2026

  8. [8]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv:2310.06770, 2023

  9. [9]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering” . arXiv:2405.15793, 2024

  10. [10]

    Software Delegates: Delegation Contracts for AI Coding Agents

    Vincent Schmalbach. “Software Delegates: Delegation Contracts for AI Coding Agents” . Working paper, 2026. https://www.vincentschmalbach.com/

  11. [11]

    AIDev: Studying AI Coding Agents on GitHub

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. “AIDev: Studying AI Coding Agents on GitHub” . arXiv:2602.09185, 2026

  12. [12]

    A model for types and levels of human interaction with automation

    Raja Parasuraman, Thomas B. Sheridan, and Christopher D. Wickens. “A model for types and levels of human interaction with automation” . IEEE Transactions on Systems, Man, and Cybernetics, Part A, 30(3):286–297, 2000

  13. [13]

    Updates in Human-AI Teams: Understanding and Addressing the Perfor- mance/Compatibility Tradeoff

    Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, Daniel S. Weld, and Walter S. Lasecki. “Updates in Human-AI Teams: Understanding and Addressing the Perfor- mance/Compatibility Tradeoff” . AAAI, 2019

  14. [14]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. “Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena” . arXiv:2306.05685, 2023

  15. [15]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. “Agentless: Demys- tifying LLM-based Software Engineering Agents” . arXiv:2407.01489, 2024

  16. [16]

    Meaningful Human Control over Autonomous Systems: A Philosophical Account

    Filippo Santoni de Sio and Jeroen van den Hoven. “Meaningful Human Control over Autonomous Systems: A Philosophical Account” . Frontiers in Robotics and AI, 2018

  17. [17]

    A Study on Developer Behaviors for Validating and Repairing LLM- Generated Code Using Eye Tracking and IDE Actions

    Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMillan, and Toby Jia-Jun Li. “A Study on Developer Behaviors for Validating and Repairing LLM- Generated Code Using Eye Tracking and IDE Actions” . arXiv:2405.16081, 2024

  18. [18]

    About GitHub Copilot cloud agent

    GitHub Docs. “About GitHub Copilot cloud agent” . https://docs.github.com/en/copilot /concepts/agents/cloud-agent/about-cloud-agent . Accessed June 12, 2026. 10

  19. [19]

    Claude Code overview

    Claude Code Documentation. “Claude Code overview” . https://code.claude.com/docs/e n/overview. Accessed June 12, 2026

  20. [20]

    On Developers’ Self-Declaration of AI-Generated Code: An Analysis of Practices

    Syed Mohammad Kashif, Peng Liang, and Amjed Tahir. “On Developers’ Self-Declaration of AI-Generated Code: An Analysis of Practices” . arXiv:2504.16485, 2025

  21. [21]

    Bots in software engineering: a systematic mapping study

    Sivasurya Santhanam, Tobias Hecking, Andreas Schreiber, and Stefan Wagner. “Bots in software engineering: a systematic mapping study” . PeerJ Computer Science, 8:e866, 2022

  22. [22]

    Automatic Software Repair: A Bibliography

    Martin Monperrus. “Automatic Software Repair: A Bibliography” . arXiv:1807.00515, 2018. 11