arxiv: 2604.18071 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

Architectural Design Decisions in AI Agent Harnesses

Hu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsagent harnessesarchitectural patternsdesign dimensionssubagent architecturecontext managementtool systemssafety mechanisms

0 comments

The pith

Analysis of 70 AI agent projects identifies five recurring design dimensions and five common architectural patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts an empirical study of 70 publicly available AI agent system projects through source-code and technical-material reading. It focuses on the reusable infrastructure that handles tool mediation, context handling, delegation, safety control, and orchestration around large language models. The study identifies five recurring design dimensions and finds that the projects favor certain strategies such as file-persistent context and registry-oriented tools, while revealing co-occurrences like deeper coordination pairing with explicit context services. These regularities matter because they offer a map of current engineering practices that can inform choices by framework designers and users. If accurate, the patterns provide grounded guidance for building more consistent and effective agent systems.

Core claim

The paper establishes that agent harnesses are structured along five design dimensions—subagent architecture, context management, tool systems, safety mechanisms, and orchestration—and that cross-project analysis of 70 systems reveals consistent co-occurrences and five recurring architectural patterns: lightweight tools, balanced CLI frameworks, multi-agent orchestrators, enterprise systems, and scenario-verticalized projects. It shows preferences for hybrid and hierarchical context strategies, dominant registry-oriented tool systems with emerging extensions, and common intermediate isolation alongside rare high-assurance audit features.

What carries the argument

The five design dimensions of subagent architecture, context management, tool systems, safety mechanisms, and orchestration, together with their observed co-occurrence patterns across projects, which structure the derivation of the five architectural patterns.

Load-bearing premise

Reading source code and technical materials from 70 publicly available projects is sufficient to capture the full range of architectural design decisions without major selection bias or missing runtime behaviors.

What would settle it

Identification of a widely adopted AI agent system whose architecture uses design choices or patterns that fall outside the five dimensions and five patterns would challenge the claim of recurring regularities.

read the original abstract

AI agent systems increasingly rely on reusable non-LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety control, and orchestration. Yet the architectural design decisions in this surrounding infrastructure remain understudied. This paper presents a protocol-guided, source-grounded empirical study of 70 publicly available agent-system projects, addressing three questions: which design-decision dimensions recur across projects, which co-occurrences structure those decisions, and which typical architectural patterns emerge. Methodologically, we contribute a transparent investigation procedure for analyzing heterogeneous agent-system corpora through source-code and technical-material reading. Empirically, we identify five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration) and find that the corpus favors file-persistent, hybrid, and hierarchical context strategies; registry-oriented tool systems remain dominant while MCP- and plugin-oriented extensions are emerging; and intermediate isolation is common but high-assurance audit is rare. Cross-project co-occurrence analysis reveals that deeper coordination pairs with more explicit context services, stronger execution environments with more structured governance, and formalized tool-registration boundaries with broader ecosystem ambitions. We synthesize five recurring architectural patterns spanning lightweight tools, balanced CLI frameworks, multi-agent orchestrators, enterprise systems, and scenario-verticalized projects. The result provides an evidence-based account of architectural regularities in agent-system engineering, with grounded guidance for framework designers, selectors, and researchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a protocol-guided empirical study of architectural design decisions across 70 publicly available AI agent harness projects. It identifies five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration), analyzes their co-occurrences, and synthesizes five typical architectural patterns ranging from lightweight tools to enterprise systems, while contributing a transparent source-code and technical-material reading procedure.

Significance. If the methodological details and empirical claims hold, the work provides a valuable evidence-based account of architectural regularities in agent-system engineering. The transparent investigation procedure, cross-project co-occurrence analysis, and synthesis of patterns across project types represent clear strengths that can offer grounded guidance to framework designers, selectors, and researchers in the rapidly evolving AI agent field.

major comments (3)

[Methods] Methods section: The manuscript does not document explicit project selection criteria, inclusion/exclusion rules, sampling frame, or inter-rater reliability for the source-code and technical-material analysis of the 70 projects. This is load-bearing for the central claims, as it prevents verification that the identified dimensions and patterns are not artifacts of selection bias toward publicly visible GitHub projects.
[Results] Results (co-occurrence analysis): The reported co-occurrences (e.g., deeper coordination pairing with explicit context services, stronger execution environments with structured governance) lack exact quantitative metrics, statistical measures, or validation details, undermining assessment of their robustness and generalizability beyond the inspected corpus.
[Patterns synthesis] Patterns synthesis section: The five architectural patterns are synthesized from qualitative interpretation of the corpus; without explicit mapping tables or traceable links from the 70 projects' data to each pattern, the claim that these represent recurring regularities remains under-supported.

minor comments (2)

[Abstract/Introduction] The abstract and introduction could more explicitly state the total project count and high-level inclusion criteria upfront to improve readability and impact.
[Introduction] Terminology such as 'MCP- and plugin-oriented extensions' and 'file-persistent, hybrid, and hierarchical context strategies' would benefit from a brief parenthetical definition or reference on first use for readers outside the immediate subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving methodological transparency and the evidential support for our empirical claims. We address each major comment below and commit to revisions that strengthen the paper without altering its core contributions.

read point-by-point responses

Referee: [Methods] Methods section: The manuscript does not document explicit project selection criteria, inclusion/exclusion rules, sampling frame, or inter-rater reliability for the source-code and technical-material analysis of the 70 projects. This is load-bearing for the central claims, as it prevents verification that the identified dimensions and patterns are not artifacts of selection bias toward publicly visible GitHub projects.

Authors: We agree that the Methods section requires greater explicitness to support verification. In the revised manuscript we will expand it with: (1) the precise search strategy, keywords, and repositories used to identify candidates; (2) inclusion criteria (publicly available agent harness projects with documented tool/context/orchestration components, active as of our 2024 data cutoff) and exclusion criteria (pure model wrappers, non-functional demos, or projects lacking sufficient technical material); (3) the sampling frame (top projects by a composite of GitHub stars, forks, and recent commits, yielding the final 70); and (4) the protocol for source-code and technical-material reading, including iterative coding rounds and consensus discussions among the authors to resolve ambiguities. We did not compute formal inter-rater reliability statistics because the analysis was performed by a small expert team with overlapping domain knowledge; we will instead describe the bias-mitigation steps taken. These additions will allow readers to evaluate selection bias directly. revision: yes
Referee: [Results] Results (co-occurrence analysis): The reported co-occurrences (e.g., deeper coordination pairing with explicit context services, stronger execution environments with structured governance) lack exact quantitative metrics, statistical measures, or validation details, undermining assessment of their robustness and generalizability beyond the inspected corpus.

Authors: We accept that the co-occurrence claims would be more robust with quantitative backing. The revision will add exact counts and percentages for each highlighted co-occurrence (e.g., “Of the 45 projects exhibiting deeper coordination, 32 (71 %) also used explicit context services”). Where cell counts permit, we will include simple association measures such as support/confidence from association-rule mining or chi-square tests, accompanied by a clear statement of the exploratory character of the analysis and its limited generalizability beyond the 70-project corpus. This will enable readers to assess the strength of the reported regularities. revision: yes
Referee: [Patterns synthesis] Patterns synthesis section: The five architectural patterns are synthesized from qualitative interpretation of the corpus; without explicit mapping tables or traceable links from the 70 projects' data to each pattern, the claim that these represent recurring regularities remains under-supported.

Authors: The five patterns emerged from qualitative clustering of the observed design-dimension combinations. To make the synthesis traceable, the revised paper will include (as an appendix) an explicit mapping table. For each pattern we will list 3–5 representative projects drawn from the 70, together with the concrete data points (specific choices on subagent architecture, context strategy, tool registration, safety level, and orchestration style) that align with the pattern description. This will provide verifiable links between the raw corpus observations and the synthesized patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical survey

full rationale

The paper performs a protocol-guided reading of source code and technical materials from 70 publicly available projects to identify recurring design dimensions and architectural patterns. No mathematical derivations, fitted parameters, predictions, or self-citations are present in the described method or claims. The five dimensions and five patterns are extracted directly from the inspected corpus without any reduction to prior self-referential results or definitional loops. This is a standard descriptive study whose validity rests on sampling and observation rather than any internal derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the domain assumption that publicly available projects adequately represent current agent-system engineering practice; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 70 selected publicly available agent-system projects form a representative sample of architectural decisions in the field.
All reported dimensions, co-occurrences, and patterns are derived from this corpus.

pith-pipeline@v0.9.0 · 5535 in / 1197 out tokens · 33829 ms · 2026-05-10T04:46:11.513421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 1 canonical work pages · 1 internal anchor

[1]

A Survey on Large Language Model based Autonomous Agents

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Ji-Rong. A Survey on Large Language Model based Autonomous Agents. 2024

2024
[2]

AgentBench: Evaluating LLMs as Agents

Liu, Xiao and Xu, Hanchi and Wu, Lei and others. AgentBench: Evaluating LLMs as Agents. 2023

2023
[3]

AgentScope

ModelScope / Tongyi Lab. AgentScope. 2024.https://github.com/ modelscope/agentscope

2024
[4]

Significant-Gravitas. AutoGPT. 2023.https://github.com/ Significant-Gravitas/AutoGPT

2023
[5]

ChatDev: Communicative Agents for Software Development

Qian, Chen and Cong, Xin and Liu, Wei and Yang, Cheng and Chen, Weize and Su, Yusheng and Dang, Yufan and Li, Jiahao and Xu, Juyuan and Li, Dahai and Liu, Zhiyuan and Sun, Maosong. ChatDev: Communicative Agents for Software Development. 2023

2023
[6]

Claude Code

Anthropic. Claude Code. 2024. Anthropic product; closed-source coding agent assis- tant. Claude Code is a closed-source commercial product by Anthropic, not an academic publication. 35

2024
[7]

and Bachmann, Felix and Bass, Len and Garlan, David and Ivers, James and Little, Reed and Merson, Paulo and Nord, Robert and Stafford, Judith A

Clements, Paul C. and Bachmann, Felix and Bass, Len and Garlan, David and Ivers, James and Little, Reed and Merson, Paulo and Nord, Robert and Stafford, Judith A.. Documenting Software Architectures: Views and Beyond. 2010

2010
[8]

Joao Moura. CrewAI. 2023.https://github.com/joaomdmoura/crewai

2023
[9]

docker-agent

Docker. docker-agent. 2024.https://github.com/docker/docker-agent

2024
[10]

Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents

Gan, Yuyou and Yang, Yong and Ma, Zhe and He, Ping and Zeng, Rui and Wang, Yiming and Li, Qingming and Zhou, Chunyi and Li, Songze and Wang, Ting and Gao, Yunjun and Wu, Yingcai and Ji, Shouling. Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents. 2024

2024
[11]

and Cai, Cameron J

Park, Joon Sung and O’Brien, Jacob C. and Cai, Cameron J. and Morris, Meredith Ring and Liang, Percy and Bernstein, Michael S.. Generative Agents: Interactive Simulacra of Human Behavior. 2023

2023
[12]

Software Architecture as a Set of Architectural Design Decisions

Jansen, Anton and Bosch, Jan. Software Architecture as a Set of Architectural Design Decisions. 2006

2006
[13]

LangChain

LangChain Team. LangChain. 2023.https://github.com/langchain-ai/ langchain

2023
[14]

LangGraph

LangChain Team. LangGraph. 2024.https://github.com/langchain-ai/ langgraph

2024
[15]

LlamaIndex

LlamaIndex. LlamaIndex. 2024.https://www.llamaindex.ai/llamaindex

2024
[16]

Model Context Pro- tocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

Hou, Xinyi and Zhao, Yanjie and Wang, Shenao and Wang, Haoyu. Model Context Pro- tocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers. 2025

2025
[17]

MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits

Radosevich, Brandon and Halloran, John. MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits. 2025

2025
[18]

MetaGPT: Meta Programming for Multi-Agent Collaborative Framework

Hong, Sirui and Zhuge, Mingchen and Chen, Jiaqi and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, Jurgen. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. 2023

2023
[19]

OpenClaw: Your Own Personal AI Assistant

OpenClaw Team. OpenClaw: Your Own Personal AI Assistant. 2024.https:// github.com/openclaw/openclaw. A self-hosted personal AI assistant platform supporting multi-channel messaging (WhatsApp, Telegram, Slack, Discord, etc.), browser automation, and skill-based extensibility

2024
[20]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Wang, Xingyao and others. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. 2024. arXiv preprint arXiv:2407.16741

work page internal anchor Pith review arXiv 2024
[21]

Plan, Props, and Models of Large Language Models: A Survey

Hao, Shibo and Gu, Yao and Cao, Haodi and others. Plan, Props, and Models of Large Language Models: A Survey. 2023. 36

2023
[22]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Hao, Kefan and Xiao, Chi. ReAct: Synergizing Reasoning and Acting in Language Models. 2022

2022
[23]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik R

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik R. and Press, Ofir. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. 2024

2024
[24]

SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, Carlos and Yang, John and Wettig, Alexander and others. SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?. 2023

2023
[25]

tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik. tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024

2024
[26]

Past and Future of Software Architectural Decisions: A Systematic Mapping Study

Tofan, Dan and Galster, Matthias and Avgeriou, Paris and Schuitema, Wes. Past and Future of Software Architectural Decisions: A Systematic Mapping Study. 2014

2014
[27]

Tool- former: Language Models Can Teach Themselves to Use Tools

Schick, Timo and Dwivedi-Yu, Jane and Dessi, Roberto and Raileanu, Roberta and Lomeli, Maria and Zettlemoyer, Luke and Caneb, Nicola and Scialom, Thibault. Tool- former: Language Models Can Teach Themselves to Use Tools. 2023

2023
[28]

and Wei, Jessica and others

Zhou, Shuyan and Arrorb, Frank F. and Wei, Jessica and others. WebArena: A Realistic Web Environment for Building Autonomous Agents. 2023. 37 A Project Corpus A.1 Project Selection and Data Collection We selected projects according to the following criteria: (1) publicly available implementa- tion evidence, typically source code on platforms such as GitHu...

2023
[29]

public repository search, primarily on GitHub and comparable public code-hosting sur- faces
[30]

references encountered during related-work review and framework documentation read- ing
[31]

manual review of widely discussed Agent frameworks, coding agents, and orchestration tools; and
[32]

AI agent

snowball expansion from already retained projects to adjacent frameworks, forks, com- petitors, and explicitly referenced alternatives. C.2 Keyword Families Repository search and manual discovery repeatedly used keyword families such as the follow- ing: • “AI agent” • “agent” • “harness” • “vibe coding” • “agent framework” • “agent platform” • “agent syst...
[33]

publicly available implementation evidence, detailed public technical documentation, or other source-visible material sufficient for architectural inspection
[34]

meaningful execution-substrate functionality such as tool mediation, orchestration, per- sistence, workspace interaction, reusable delegation, or governance controls
[35]

implementation scale usually exceeding approximately 500 lines of code, or an equiva- lent architectural footprint for public comparison cases; and
[36]

enough observable structure to support comparative coding under the paper’s five focal dimensions. Projects were excluded when they were primarily prompt collections, thin API wrappers, benchmark-only artifacts, very small demonstrations, or products lacking enough public im- plementation evidence to support source- or paper-grounded coding. C.5 Corpus Fr...

2026