Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work
Pith reviewed 2026-06-27 04:14 UTC · model grok-4.3
The pith
Explicit delegation contracts improve reviewability of AI coding agent work without changing objective task outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In this pilot, explicit delegation contracts did not alter the fact that every one of the 64 agent runs passed hidden acceptance tests with zero scope violations, yet they produced measurable gains in reviewability: evidence sufficiency rose by 0.83 points on a 5-point scale, reviewer ambiguity fell, and contract-specified sections such as changed-file lists and reviewer checklists appeared mostly or exclusively when the contract format required them.
What carries the argument
The software delegation contract, defined as the unit covering the task, authority bounds, returned work package, and acceptance context, used to structure and analyze delegated coding work.
Load-bearing premise
The assumption that results from a dependency-free TypeScript environment with ten small seeded-defect tasks and model-based reviewers serve as a valid proxy for real-world software delegation and human review.
What would settle it
Running the same outputs through human reviewers and measuring whether contracts reduce review time, clarification requests, or missed defects compared with issue-style prompts.
read the original abstract
AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned work package, and acceptance context, as the unit of analysis for delegated coding work, but did not measure its effects. This paper reports a controlled pilot study of explicit delegation contracts for coding agents. We built a dependency-free TypeScript API task environment with seeded defects and documentation gaps, authored ten tasks across five families, and ran 64 agent executions across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, for 192 reviews. Explicit contracts did not improve objective task outcomes: all 64 runs passed hidden acceptance checks, with zero scope violations. They did improve reviewability. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66); reviewer ambiguity decreased (p = 0.035); changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when demanded by the contract. Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier. On these small tasks, delegation contracts bought reviewability rather than correctness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a controlled pilot study of explicit software delegation contracts for AI coding agents. In a dependency-free TypeScript environment with ten seeded-defect tasks, 64 agent executions under three conditions (issue-style prompt, explicit contract, contract with evidence bundle) were evaluated via hidden acceptance tests, mutation checks, scope analysis, and 192 condition-blinded model-based reviews. The central finding is that contracts produced no change in objective outcomes (all 64 runs passed acceptance checks with zero scope violations) but improved reviewability metrics, including evidence sufficiency (+0.83 on a 5-point scale in 22 of 30 paired comparisons, p<0.0001, Cliff's delta=0.66), reduced ambiguity (p=0.035), and greater inclusion of changed-file lists, known-limitations, residual-risk sections, and checklists, at the cost of +13% tokens and +38% wall-clock time.
Significance. If the results hold, the work supplies the first quantitative evidence that delegation contracts can measurably improve reviewability of AI coding work independently of correctness, using paired comparisons, p-values, and effect sizes. The controlled design with blinded reviewers and hidden tests is a strength that could serve as a template for subsequent studies on human-AI software delegation.
major comments (3)
- [Abstract] Abstract and Results: The headline claim that contracts 'bought reviewability rather than correctness' rests on an untested assumption. All 64 runs passed hidden acceptance checks with zero scope violations, producing zero variance in the correctness metric; the experiment therefore cannot distinguish whether contracts would have raised, lowered, or left unchanged the failure rate in a regime where baseline failures are observable.
- [Methods] Methods: The choice of three condition-blinded model-based reviewers with a fixed rubric, rather than human reviewers, is load-bearing for the reviewability claims (evidence sufficiency, ambiguity reduction). The paper does not report any validation that model-based scores correlate with human reviewer judgments on the same artifacts.
- [Discussion] Discussion: The interpretation that contracts improve reviewability 'rather than' correctness is undercut by the ceiling effect; a revised framing limited to the observed data (reviewability gains with no detectable correctness effect under these conditions) would be more proportionate to the evidence.
minor comments (2)
- [Abstract] The abstract states that raw data and code are not provided; adding a data-availability statement with repository link would strengthen reproducibility claims.
- [Results] Table or figure presenting the per-task and per-model breakdown of the 30 paired comparisons would clarify which conditions drove the +0.83 evidence-sufficiency improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our pilot study. We address each major comment below and will make targeted revisions to improve precision and acknowledge limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract and Results: The headline claim that contracts 'bought reviewability rather than correctness' rests on an untested assumption. All 64 runs passed hidden acceptance checks with zero scope violations, producing zero variance in the correctness metric; the experiment therefore cannot distinguish whether contracts would have raised, lowered, or left unchanged the failure rate in a regime where baseline failures are observable.
Authors: We agree that the pilot exhibits a ceiling effect, with zero variance in correctness outcomes. The abstract phrasing implies a general trade-off that the data cannot support. We will revise the abstract to state that contracts produced reviewability gains with no detectable correctness effect under the tested conditions, removing the 'rather than' framing. revision: yes
-
Referee: [Methods] Methods: The choice of three condition-blinded model-based reviewers with a fixed rubric, rather than human reviewers, is load-bearing for the reviewability claims (evidence sufficiency, ambiguity reduction). The paper does not report any validation that model-based scores correlate with human reviewer judgments on the same artifacts.
Authors: The absence of reported correlation between model-based and human judgments is a genuine methodological limitation. We will add an explicit discussion of this choice in the Methods and Limitations sections, including the rationale for using blinded model reviewers (consistency, scalability, and blinding feasibility in the pilot) and noting the lack of human validation as an important avenue for future work. No new validation data will be added at this stage. revision: yes
-
Referee: [Discussion] Discussion: The interpretation that contracts improve reviewability 'rather than' correctness is undercut by the ceiling effect; a revised framing limited to the observed data (reviewability gains with no detectable correctness effect under these conditions) would be more proportionate to the evidence.
Authors: We accept this critique. The current interpretation overreaches given the ceiling effect. We will revise the Discussion (and cross-referenced sections) to restrict claims to the observed data, explicitly highlighting the ceiling effect and the pilot-study scope. revision: yes
Circularity Check
No circularity: results are direct empirical measurements
full rationale
The paper's central claims rest on a controlled experiment reporting objective pass rates (all 64 runs passed hidden acceptance checks with zero scope violations) and reviewability metrics (evidence sufficiency improved in 22/30 paired comparisons, p<0.0001). These are direct observations from 192 blinded reviews under three prompt conditions; no equations, fitted parameters, or derivations are present that could reduce outputs to inputs by construction. Prior work is referenced only for background on the contract concept, not as load-bearing justification for the measured effects. The ceiling-effect observation affects interpretation strength but does not create circularity in the reported chain.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Statistical tests (p-values, Cliff's delta) are valid under the sample sizes and distributions used.
- domain assumption Model-based reviewers produce scores comparable to human reviewers for the rubric.
Reference graph
Works this paper leans on
-
[1]
All 64 runs passed the hidden acceptance checks; no run violated its authority boundary
Objective outcomes saturated. All 64 runs passed the hidden acceptance checks; no run violated its authority boundary. On small, well-specified tasks, current models do not need a contract to succeed
-
[2]
Evidence sufficiency rose in 22 of 30 paired comparisons and fell in none (+0.83 on a 5-point scale; p < 0.0001; Cliff’s 𝛿 = 0.66)
Reviewability changed. Evidence sufficiency rose in 22 of 30 paired comparisons and fell in none (+0.83 on a 5-point scale; p < 0.0001; Cliff’s 𝛿 = 0.66). Reviewer ambiguity fell (p = 0.035, judge-level p = 0.0001). Structured report elements appeared almost exclusively under contracts, and some elements (residual risks, reviewer checklists) appeared only ...
-
[3]
Agent tokens rose 13%, wall-clock time rose 38%, and tool invocations rose 23% (all p ≤ 0.001)
Contracts added run cost. Agent tokens rose 13%, wall-clock time rose 38%, and tool invocations rose 23% (all p ≤ 0.001). That overhead bought evidence, not correctness
-
[4]
refactor
The weaker model benefited more , suggesting contracts partially substitute for the reporting discipline stronger models exhibit unprompted. Contributions. (i) A reusable experimental harness for delegation-contract studies, built from a seeded task repository, paired prompts, hidden acceptance and mutation checks, mechanical scope analysis, and a blinded...
-
[5]
and meaningful-human-control accounts [10] supply the framing that delegation is a control relationship; human-AI teaming results [7] argue that team value depends on verification cost, which our evidence-sufficiency results operationalize. Studies of provenance and disclosure [11, 14] show that knowing how code was produced changes reviewer behavior; we c...
-
[6]
The delegation-contract framework itself, including the ⟨𝑇 , 𝐴, 𝑊 , 𝐶⟩ model and the testing agenda this pilot executes, is developed in [4]
and automated program repair [16] anticipate the artifact-with-evidence framing. The delegation-contract framework itself, including the ⟨𝑇 , 𝐴, 𝑊 , 𝐶⟩ model and the testing agenda this pilot executes, is developed in [4]. 8 Conclusion We turned the software-delegation-contract framework into a measurement instrument and ran it. In 64 controlled coding-ag...
-
[7]
Codex web
OpenAI Codex Documentation. “Codex web” . https://developers.openai.com/codex/cl oud. Accessed June 12, 2026
2026
-
[8]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering” . arXiv:2405.15793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Software Delegates: Delegation Contracts for AI Coding Agents
Vincent Schmalbach. “Software Delegates: Delegation Contracts for AI Coding Agents” . Working paper, 2026. https://www.vincentschmalbach.com/
2026
-
[11]
AIDev: Studying AI Coding Agents on GitHub
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. “AIDev: Studying AI Coding Agents on GitHub” . arXiv:2602.09185, 2026
-
[12]
A model for types and levels of human interaction with automation
Raja Parasuraman, Thomas B. Sheridan, and Christopher D. Wickens. “A model for types and levels of human interaction with automation” . IEEE Transactions on Systems, Man, and Cybernetics, Part A, 30(3):286–297, 2000
2000
-
[13]
Updates in Human-AI Teams: Understanding and Addressing the Perfor- mance/Compatibility Tradeoff
Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, Daniel S. Weld, and Walter S. Lasecki. “Updates in Human-AI Teams: Understanding and Addressing the Perfor- mance/Compatibility Tradeoff” . AAAI, 2019
2019
-
[14]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. “Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena” . arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. “Agentless: Demys- tifying LLM-based Software Engineering Agents” . arXiv:2407.01489, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Meaningful Human Control over Autonomous Systems: A Philosophical Account
Filippo Santoni de Sio and Jeroen van den Hoven. “Meaningful Human Control over Autonomous Systems: A Philosophical Account” . Frontiers in Robotics and AI, 2018
2018
-
[17]
Ningzhi Tang, Meng Chen, Zheng Ning, Aakash Bansal, Yu Huang, Collin McMillan, and Toby Jia-Jun Li. “A Study on Developer Behaviors for Validating and Repairing LLM- Generated Code Using Eye Tracking and IDE Actions” . arXiv:2405.16081, 2024
-
[18]
About GitHub Copilot cloud agent
GitHub Docs. “About GitHub Copilot cloud agent” . https://docs.github.com/en/copilot /concepts/agents/cloud-agent/about-cloud-agent . Accessed June 12, 2026. 10
2026
-
[19]
Claude Code overview
Claude Code Documentation. “Claude Code overview” . https://code.claude.com/docs/e n/overview. Accessed June 12, 2026
2026
-
[20]
On Developers’ Self-Declaration of AI-Generated Code: An Analysis of Practices
Syed Mohammad Kashif, Peng Liang, and Amjed Tahir. “On Developers’ Self-Declaration of AI-Generated Code: An Analysis of Practices” . arXiv:2504.16485, 2025
-
[21]
Bots in software engineering: a systematic mapping study
Sivasurya Santhanam, Tobias Hecking, Andreas Schreiber, and Stefan Wagner. “Bots in software engineering: a systematic mapping study” . PeerJ Computer Science, 8:e866, 2022
2022
-
[22]
Automatic Software Repair: A Bibliography
Martin Monperrus. “Automatic Software Repair: A Bibliography” . arXiv:1807.00515, 2018. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.