pith. sign in

arxiv: 2606.22484 · v1 · pith:QI5HLTKSnew · submitted 2026-06-21 · 💻 cs.HC

Governed AI-Assisted Engineering: Graduated Human Oversight for Agentic Code Generation in Regulated Domains

Pith reviewed 2026-06-26 10:10 UTC · model grok-4.3

classification 💻 cs.HC
keywords GAIE frameworkagentic code generationhuman oversightregulatory complianceOversight Classification ModelAI governanceregulated domains
0
0 comments X

The pith

The GAIE framework routes agentic code tasks into three oversight tiers to preserve most productivity while supplying compliance evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Governed AI-Assisted Engineering framework as a way to govern autonomous AI agents that generate code in regulated industries. It defines the Oversight Classification Model as a deterministic function that assigns tasks to human-in-the-loop, human-over-the-loop, or automated-with-monitoring tiers according to regulatory impact, customer proximity, reversibility, and data sensitivity. Each tier specifies the evidence artifacts needed for audits. Mapping to standards such as the Bank of Thailand policy, MAS, NIST AI RMF, ISO/IEC 42001, and the EU AI Act shows the model applies across jurisdictions. Analytical modeling indicates the approach retains 84 to 97 percent of agentic coding velocity with a central estimate of 91 percent.

Core claim

The GAIE framework contributes a three-tier graduated human oversight model for agentic code generation that bridges AI-assisted development maturity with regulatory governance through proportionate human oversight. The Oversight Classification Model classifies code generation tasks by regulatory impact, customer proximity, reversibility, and data sensitivity to route them through human-in-the-loop for strategic functions, human-over-the-loop for customer-impacting functions, or automated-with-monitoring for internal functions, each with required evidence artifacts for compliance auditability. Evaluation through regulatory coverage analysis, comparative framework analysis, and analytical pro

What carries the argument

The Oversight Classification Model (OCM), a deterministic decision function that classifies code generation tasks by regulatory impact, customer proximity, reversibility, and data sensitivity to assign one of three oversight tiers.

Load-bearing premise

The analytical productivity modeling used to derive the 84-97% velocity preservation range accurately reflects real-world outcomes of the proposed tiers without reliance on unstated assumptions about task distributions or oversight effectiveness.

What would settle it

A controlled deployment in a regulated organization that measures actual code-generation velocity and audit pass rates under each GAIE tier against baselines of full automation and full human oversight.

Figures

Figures reproduced from arXiv: 2606.22484 by Richard Kang.

Figure 1
Figure 1. Figure 1: Evolution of AI-Assisted Software Development [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GAIE Framework Architecture Overview [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: OCM Decision Tree [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Three-Tier Oversight Model — Sequence Diagrams [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evidence Chain Integrity Model Initial conservative classification Downward - N clean deploys + approval Downward - N clean deploys + approval Anomaly or scope expansion Incident or regulatory change Regulatory override immediate Regulatory override immediate Tier1 Tier2 Tier3 Downward conditions N at least 20 clean deploys Zero anomalies Rejection rate below 5 percent Second-line compliance approval Upwar… view at source ↗
Figure 6
Figure 6. Figure 6: Tier Reclassification Lifecycle IV. REGULATORY MAPPING A. Bank of Thailand AI Risk Management Policy (2025) We provide a traceability mapping from GAIE components to applicable BOT requirements (Table V). Traceability assessment: Based on the authors’ reading of the publicly available BOT circular, GAIE’s design addresses 9 of 10 applicable control domains. This represents academic analysis of published re… view at source ↗
Figure 7
Figure 7. Figure 7: GAIE Threat Model TABLE V BOT POLICY → GAIE TRACEABILITY MAPPING BOT Requirement GAIE Component Evidence Human participation (§4.3 P4) OCM → Tier 1 + HITL Classification + approval Lifecycle governance (§4.4) Three-tier model Per-phase artifacts FEAT principles (§4.3 P1–4) OCM + evidence model Audit trail Data boundaries (§4.4 Pt.2(1)) Data boundary enforcement Filtering logs Testing (§4.4 Pt.2(2)) Gen-val… view at source ↗
Figure 8
Figure 8. Figure 8: Cross-Jurisdiction Regulatory Mapping TABLE VI CROSS-JURISDICTION TRACEABILITY Framework Key Requirement GAIE Mapping Gap MAS FEAT FEAT principles OCM transparency + human gates Dev lifecycle NIST AI RMF Govern, Map, Measure, Manage Four functions mapped Full traceability ISO/IEC 42001 Risk-based AI mgmt system OCM = risk-based approach Clause 4 separate EU AI Act Human oversight (Art. 14) Tier 1 for high-… view at source ↗
Figure 9
Figure 9. Figure 9: Reference Implementation Architecture Phase 1 Instrument Phase 2 Classify shadow mode Phase 3 Enforce Tier 3 Phase 4 Enforce Tier 2 Phase 5 Enforce Tier 1 Phase 6 Optimize [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Phased Adoption Sequence [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Productivity Impact — GAIE vs. Uniform Oversight [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
read the original abstract

The adoption of agentic AI coding systems -- where autonomous agents generate, review, test, and deploy code with minimal human intervention -- creates a governance challenge in regulated industries. Existing frameworks address AI-assisted development maturity or the productivity-reliability tension but offer no mechanism for calibrating human oversight intensity to regulatory impact. We present the Governed AI-Assisted Engineering (GAIE) framework, a three-tier graduated human oversight model for agentic code generation in regulated domains. GAIE introduces the Oversight Classification Model (OCM), a deterministic decision function that classifies code generation tasks by regulatory impact, customer proximity, reversibility, and data sensitivity to route them through one of three oversight tiers: human-in-the-loop (strategic functions), human-over-the-loop (customer-impacting), or automated-with-monitoring (internal). Each tier defines required evidence artifacts for compliance auditability. We map GAIE against the Bank of Thailand's 2025 AI risk-management policy and demonstrate cross-jurisdiction applicability to MAS (Singapore), NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Evaluation through regulatory coverage analysis, comparative framework analysis, and analytical productivity modeling suggests that graduated oversight preserves 84--97% of agentic coding velocity (central estimate: 91%) while maintaining compliance evidence coverage for regulated functions. GAIE contributes a framework that explicitly bridges AI-assisted development maturity with regulatory governance through proportionate human oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the Governed AI-Assisted Engineering (GAIE) framework, a three-tier graduated human oversight model for agentic code generation in regulated domains. It introduces the Oversight Classification Model (OCM) as a deterministic decision function classifying tasks by regulatory impact, customer proximity, reversibility, and data sensitivity to route them to human-in-the-loop (strategic), human-over-the-loop (customer-impacting), or automated-with-monitoring (internal) tiers, each with defined compliance evidence artifacts. The framework is mapped to the Bank of Thailand 2025 AI policy and shown applicable to MAS, NIST AI RMF, ISO/IEC 42001, and EU AI Act. Evaluation via regulatory coverage analysis, comparative framework analysis, and analytical productivity modeling claims that the approach preserves 84--97% of agentic coding velocity (central estimate 91%) while maintaining compliance evidence coverage.

Significance. If the velocity preservation result holds under transparent validation, GAIE would supply a missing bridge between AI-assisted development maturity models and regulatory governance requirements, offering a proportionate oversight mechanism that could guide adoption in finance and other regulated sectors. The explicit tier definitions and cross-jurisdiction mapping constitute a useful conceptual contribution even if the numeric estimate requires further substantiation.

major comments (2)
  1. [Evaluation section (analytical productivity modeling)] The central quantitative claim (84--97% velocity preservation, central 91%) rests on 'analytical productivity modeling' whose methods, inputs, task-type distributions, per-tier time costs, oversight-effectiveness parameters, assumptions, or validation are not described anywhere in the manuscript. This modeling is load-bearing for the practicality argument and cannot be assessed for circularity or external validity.
  2. [Framework description (OCM definition)] The Oversight Classification Model (OCM) is presented as a deterministic decision function, yet no explicit rules, thresholds, decision tree, pseudocode, or worked examples are supplied, preventing evaluation of its reproducibility or edge-case behavior.
minor comments (2)
  1. The abstract and evaluation section reference 'comparative framework analysis' without naming the comparator frameworks or the evaluation criteria employed.
  2. A summary table listing the three tiers, their triggering conditions, required evidence artifacts, and example use cases would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments, which highlight areas where the manuscript requires greater transparency to support its claims. We address each major comment below and commit to revisions that directly resolve the identified gaps.

read point-by-point responses
  1. Referee: [Evaluation section (analytical productivity modeling)] The central quantitative claim (84--97% velocity preservation, central 91%) rests on 'analytical productivity modeling' whose methods, inputs, task-type distributions, per-tier time costs, oversight-effectiveness parameters, assumptions, or validation are not described anywhere in the manuscript. This modeling is load-bearing for the practicality argument and cannot be assessed for circularity or external validity.

    Authors: We agree that the analytical productivity modeling section lacks the necessary methodological detail. The current manuscript states the velocity preservation range and central estimate but does not specify the underlying model structure, input parameters, task distributions, time-cost assumptions, or validation approach. This omission prevents independent assessment. In the revised manuscript we will expand the Evaluation section with a complete description of the modeling method, including explicit equations or pseudocode for the productivity calculation, the assumed task-type distribution, per-tier overhead factors, oversight-effectiveness parameters, all modeling assumptions, and a discussion of limitations and sensitivity analysis. revision: yes

  2. Referee: [Framework description (OCM definition)] The Oversight Classification Model (OCM) is presented as a deterministic decision function, yet no explicit rules, thresholds, decision tree, pseudocode, or worked examples are supplied, preventing evaluation of its reproducibility or edge-case behavior.

    Authors: We concur that the OCM description is insufficiently operationalized. While the manuscript identifies the four classification dimensions and maps them to the three oversight tiers, it does not provide the decision rules, thresholds, or logic that implement the deterministic function. This limits reproducibility and edge-case analysis. In revision we will add an explicit formalization of the OCM, including a decision tree or pseudocode representation, threshold values where applicable, and at least three worked examples covering typical, boundary, and edge-case tasks to demonstrate classification behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines the GAIE framework and OCM as a deterministic classification based on explicit criteria (regulatory impact, customer proximity, reversibility, data sensitivity) and maps it to external policies (Bank of Thailand, MAS, NIST, ISO, EU AI Act). The 84-97% velocity preservation is attributed to 'analytical productivity modeling' in the abstract, but the provided text contains no equations, parameter definitions, task distributions, or derivation steps for that modeling. Without a quotable reduction showing the numeric result is forced by the framework's own tier definitions or self-citation, no circular step matching the enumerated patterns can be exhibited. The central contribution remains a conceptual mapping whose validity does not depend on the unspecified model.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects explicitly introduced elements. The OCM and velocity modeling are new constructs without independent evidence or external grounding mentioned.

free parameters (1)
  • central velocity estimate = 91%
    The 91% central estimate for agentic coding velocity preservation is presented as output from analytical modeling but with no derivation details or data sources.
invented entities (1)
  • Oversight Classification Model (OCM) no independent evidence
    purpose: Deterministic decision function to classify tasks into oversight tiers
    New model introduced by the paper with no external validation or falsifiable handle provided.

pith-pipeline@v0.9.1-grok · 5784 in / 1394 out tokens · 21844 ms · 2026-06-26T10:10:10.608675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 13 linked inside Pith

  1. [1]

    Agentic AI in the software development lifecycle: Architec- ture, empirical evidence, and the reshaping of software engineering,

    H. Bhati, “Agentic AI in the software development lifecycle: Architec- ture, empirical evidence, and the reshaping of software engineering,” arXiv preprint arXiv:2604.26275, 2026

  2. [2]

    The rise of AI teammates in software engineering (SE) 3.0: How autonomous coding agents are reshaping software engineering,

    H. Li, “The rise of AI teammates in software engineering (SE) 3.0: How autonomous coding agents are reshaping software engineering,”arXiv preprint arXiv:2507.15003, 2025

  3. [3]

    OpenHands: An open platform for AI software developers as generalist agents,

    X. Wanget al., “OpenHands: An open platform for AI software developers as generalist agents,”arXiv preprint arXiv:2407.16741, 2024

  4. [4]

    CentaurEval: Benchmarking human-in-the-loop value in agen- tic coding,

    H. Luo, “CentaurEval: Benchmarking human-in-the-loop value in agen- tic coding,”arXiv preprint arXiv:2512.04111, 2025

  5. [5]

    The productivity-reliability paradox: Specification-driven governance for AI-augmented software development,

    S. E. Farrag, “The productivity-reliability paradox: Specification-driven governance for AI-augmented software development,”arXiv preprint arXiv:2605.01160, 2026

  6. [6]

    Large-scale randomized controlled trial of AI coding assistants: Experienced developers and complex tasks,

    Y . Lianget al., “Large-scale randomized controlled trial of AI coding assistants: Experienced developers and complex tasks,”arXiv preprint arXiv:2501.12345, 2025

  7. [7]

    Risk management in the use of artificial intelligence systems,

    Bank of Thailand, “Risk management in the use of artificial intelligence systems,” Circular ThoPho 3.5994/2568, Sep. 2025

  8. [8]

    Principles to promote fairness, ethics, accountability and transparency (FEAT) in the use of AI and data analytics in singapore’s financial sector,

    Monetary Authority of Singapore, “Principles to promote fairness, ethics, accountability and transparency (FEAT) in the use of AI and data analytics in singapore’s financial sector,” 2022

  9. [9]

    Comptroller’s handbook: Model risk management,

    Office of the Comptroller of the Currency, “Comptroller’s handbook: Model risk management,” OCC Bulletin 2011-12, updated 2024, 2024

  10. [10]

    Accountable agents in software engineering: An anal- ysis of terms of service and a research roadmap,

    C. Treude, “Accountable agents in software engineering: An anal- ysis of terms of service and a research roadmap,”arXiv preprint arXiv:2605.04532, 2026

  11. [11]

    Human-in-the-loop software development agents,

    W. Takerngsaksiri, “Human-in-the-loop software development agents,” arXiv preprint arXiv:2411.12924, 2024

  12. [12]

    Human-in-the-loop software development agents: Chal- lenges and future directions,

    J. Pasuksmit, “Human-in-the-loop software development agents: Chal- lenges and future directions,”arXiv preprint arXiv:2506.11009, 2025

  13. [13]

    SAE J3016: Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles,

    SAE International, “SAE J3016: Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles,” 2021

  14. [14]

    GitHub Copilot: Your AI pair programmer,

    GitHub, “GitHub Copilot: Your AI pair programmer,” 2021

  15. [15]

    The impact of AI on developer productivity: Evidence from GitHub Copilot,

    S. Penget al., “The impact of AI on developer productivity: Evidence from GitHub Copilot,”arXiv preprint arXiv:2302.06590, 2023

  16. [16]

    ChatGPT,

    OpenAI, “ChatGPT,” 2022

  17. [17]

    SWE-bench: Can language models resolve real- world GitHub issues?

    C. E. Jimenezet al., “SWE-bench: Can language models resolve real- world GitHub issues?”arXiv preprint arXiv:2310.06770, 2023

  18. [18]

    Magentic-UI: Towards human-in-the-loop agentic sys- tems,

    H. Mozannar, “Magentic-UI: Towards human-in-the-loop agentic sys- tems,”arXiv preprint arXiv:2507.22358, 2025

  19. [19]

    Navigating the dual landscape of AI-assisted code review,

    Z. Penget al., “Navigating the dual landscape of AI-assisted code review,” inProceedings of ICSE 2025, 2025

  20. [20]

    The AI codebase maturity model: From assisted coding to fully autonomous systems,

    A. Anderson, “The AI codebase maturity model: From assisted coding to fully autonomous systems,”arXiv preprint arXiv:2604.09388, 2026

  21. [21]

    Agentic AI in 6G software businesses: A layered maturity model,

    M. Zohaib, “Agentic AI in 6G software businesses: A layered maturity model,”arXiv preprint arXiv:2508.03393, 2025

  22. [22]

    CMMI for development, version 1.3,

    CMMI Institute, “CMMI for development, version 1.3,” Software Engi- neering Institute, 2010

  23. [23]

    A model for types and levels of human interaction with automation,

    R. Parasuraman, T. B. Sheridan, and C. D. Wickens, “A model for types and levels of human interaction with automation,”IEEE Transactions on Systems, Man, and Cybernetics, vol. 30, no. 3, pp. 286–297, 2000

  24. [24]

    From black-box confidence to measurable trust in clinical AI: A framework for evidence, supervision, and staged autonomy,

    S. Zabolotnii, “From black-box confidence to measurable trust in clinical AI: A framework for evidence, supervision, and staged autonomy,”arXiv preprint arXiv:2604.26671, 2026

  25. [25]

    NUREG-0800: Standard review plan for the review of safety analysis reports,

    U.S. Nuclear Regulatory Commission, “NUREG-0800: Standard review plan for the review of safety analysis reports,” 2020

  26. [26]

    Towards automated governance: A DSL for human-agent collaboration in software projects,

    A. Ait, “Towards automated governance: A DSL for human-agent collaboration in software projects,”arXiv preprint arXiv:2510.14465, 2025

  27. [27]

    TDD governance for multi-agent code generation via prompt engineering,

    T. Hasanli, “TDD governance for multi-agent code generation via prompt engineering,”arXiv preprint arXiv:2604.26615, 2026

  28. [28]

    A dual-helix governance approach towards reliable agentic AI for WebGIS development,

    Boyuan, “A dual-helix governance approach towards reliable agentic AI for WebGIS development,”arXiv preprint arXiv:2603.04390, 2026

  29. [29]

    Structural quality gaps in practitioner AI governance prompts: An empirical study using a five-principle evaluation frame- work,

    C. Zietsman, “Structural quality gaps in practitioner AI governance prompts: An empirical study using a five-principle evaluation frame- work,”arXiv preprint arXiv:2604.21090, 2026

  30. [30]

    Rethinking software engineering for agentic AI systems,

    M. Aleneziet al., “Rethinking software engineering for agentic AI systems,”arXiv preprint arXiv:2604.10599, 2026

  31. [31]

    AI risk management framework (AI RMF 1.0),

    National Institute of Standards and Technology, “AI risk management framework (AI RMF 1.0),” NIST AI 100-1, Jan. 2023

  32. [32]

    ISO/IEC 42001:2023 — artificial intelligence — management system,

    International Organization for Standardization, “ISO/IEC 42001:2023 — artificial intelligence — management system,” 2023

  33. [33]

    Intelligent financial system: How AI is transforming finance,

    Bank for International Settlements, “Intelligent financial system: How AI is transforming finance,”BIS Working Paper 1194, Jun. 2024

  34. [34]

    The financial stability implications of artifi- cial intelligence,

    Financial Stability Board, “The financial stability implications of artifi- cial intelligence,” Nov. 2024

  35. [35]

    Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act),

    European Union, “Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act),” 2024

  36. [36]

    OW ASP top 10 for large language model appli- cations, version 2.0,

    OW ASP Foundation, “OW ASP top 10 for large language model appli- cations, version 2.0,” 2025

  37. [37]

    ATLAS: Adversarial threat landscape for AI systems,

    MITRE Corporation, “ATLAS: Adversarial threat landscape for AI systems,” 2024

  38. [38]

    Preliminary guidelines for empirical research in software engineering,

    B. A. Kitchenhamet al., “Preliminary guidelines for empirical research in software engineering,”IEEE Transactions on Software Engineering, vol. 28, no. 8, pp. 721–734, 2002. APPENDIX Layer 1: Organizational Governance •A1. Board/senior management accountability •A2. FEAT principles adoption •A3. AI usage policy aligned with risk appetite •A4. Three lines ...

  39. [39]

    GAIE addresses governance concerns for AI-assisted development

  40. [40]

    Adoptable (as-is or adapted) for governing agentic coding

  41. [41]

    Three-tier model appropriately calibrated

  42. [42]

    Evidence artifacts sufficient for regulatory examinations

  43. [43]

    OCM dimensions capture relevant risk factors

  44. [44]

    Fail-safe default appropriate for risk tolerance

  45. [45]

    Reclassification protocol provides adequate safeguards Open-ended:Gaps not addressed; tier boundary changes; comparison to current approach; adoption barriers; jurisdiction-specific gaps