Recognition: unknown
Architectural Design Decisions in AI Agent Harnesses
Pith reviewed 2026-05-10 04:46 UTC · model grok-4.3
The pith
Analysis of 70 AI agent projects identifies five recurring design dimensions and five common architectural patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that agent harnesses are structured along five design dimensions—subagent architecture, context management, tool systems, safety mechanisms, and orchestration—and that cross-project analysis of 70 systems reveals consistent co-occurrences and five recurring architectural patterns: lightweight tools, balanced CLI frameworks, multi-agent orchestrators, enterprise systems, and scenario-verticalized projects. It shows preferences for hybrid and hierarchical context strategies, dominant registry-oriented tool systems with emerging extensions, and common intermediate isolation alongside rare high-assurance audit features.
What carries the argument
The five design dimensions of subagent architecture, context management, tool systems, safety mechanisms, and orchestration, together with their observed co-occurrence patterns across projects, which structure the derivation of the five architectural patterns.
Load-bearing premise
Reading source code and technical materials from 70 publicly available projects is sufficient to capture the full range of architectural design decisions without major selection bias or missing runtime behaviors.
What would settle it
Identification of a widely adopted AI agent system whose architecture uses design choices or patterns that fall outside the five dimensions and five patterns would challenge the claim of recurring regularities.
read the original abstract
AI agent systems increasingly rely on reusable non-LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety control, and orchestration. Yet the architectural design decisions in this surrounding infrastructure remain understudied. This paper presents a protocol-guided, source-grounded empirical study of 70 publicly available agent-system projects, addressing three questions: which design-decision dimensions recur across projects, which co-occurrences structure those decisions, and which typical architectural patterns emerge. Methodologically, we contribute a transparent investigation procedure for analyzing heterogeneous agent-system corpora through source-code and technical-material reading. Empirically, we identify five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration) and find that the corpus favors file-persistent, hybrid, and hierarchical context strategies; registry-oriented tool systems remain dominant while MCP- and plugin-oriented extensions are emerging; and intermediate isolation is common but high-assurance audit is rare. Cross-project co-occurrence analysis reveals that deeper coordination pairs with more explicit context services, stronger execution environments with more structured governance, and formalized tool-registration boundaries with broader ecosystem ambitions. We synthesize five recurring architectural patterns spanning lightweight tools, balanced CLI frameworks, multi-agent orchestrators, enterprise systems, and scenario-verticalized projects. The result provides an evidence-based account of architectural regularities in agent-system engineering, with grounded guidance for framework designers, selectors, and researchers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a protocol-guided empirical study of architectural design decisions across 70 publicly available AI agent harness projects. It identifies five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration), analyzes their co-occurrences, and synthesizes five typical architectural patterns ranging from lightweight tools to enterprise systems, while contributing a transparent source-code and technical-material reading procedure.
Significance. If the methodological details and empirical claims hold, the work provides a valuable evidence-based account of architectural regularities in agent-system engineering. The transparent investigation procedure, cross-project co-occurrence analysis, and synthesis of patterns across project types represent clear strengths that can offer grounded guidance to framework designers, selectors, and researchers in the rapidly evolving AI agent field.
major comments (3)
- [Methods] Methods section: The manuscript does not document explicit project selection criteria, inclusion/exclusion rules, sampling frame, or inter-rater reliability for the source-code and technical-material analysis of the 70 projects. This is load-bearing for the central claims, as it prevents verification that the identified dimensions and patterns are not artifacts of selection bias toward publicly visible GitHub projects.
- [Results] Results (co-occurrence analysis): The reported co-occurrences (e.g., deeper coordination pairing with explicit context services, stronger execution environments with structured governance) lack exact quantitative metrics, statistical measures, or validation details, undermining assessment of their robustness and generalizability beyond the inspected corpus.
- [Patterns synthesis] Patterns synthesis section: The five architectural patterns are synthesized from qualitative interpretation of the corpus; without explicit mapping tables or traceable links from the 70 projects' data to each pattern, the claim that these represent recurring regularities remains under-supported.
minor comments (2)
- [Abstract/Introduction] The abstract and introduction could more explicitly state the total project count and high-level inclusion criteria upfront to improve readability and impact.
- [Introduction] Terminology such as 'MCP- and plugin-oriented extensions' and 'file-persistent, hybrid, and hierarchical context strategies' would benefit from a brief parenthetical definition or reference on first use for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving methodological transparency and the evidential support for our empirical claims. We address each major comment below and commit to revisions that strengthen the paper without altering its core contributions.
read point-by-point responses
-
Referee: [Methods] Methods section: The manuscript does not document explicit project selection criteria, inclusion/exclusion rules, sampling frame, or inter-rater reliability for the source-code and technical-material analysis of the 70 projects. This is load-bearing for the central claims, as it prevents verification that the identified dimensions and patterns are not artifacts of selection bias toward publicly visible GitHub projects.
Authors: We agree that the Methods section requires greater explicitness to support verification. In the revised manuscript we will expand it with: (1) the precise search strategy, keywords, and repositories used to identify candidates; (2) inclusion criteria (publicly available agent harness projects with documented tool/context/orchestration components, active as of our 2024 data cutoff) and exclusion criteria (pure model wrappers, non-functional demos, or projects lacking sufficient technical material); (3) the sampling frame (top projects by a composite of GitHub stars, forks, and recent commits, yielding the final 70); and (4) the protocol for source-code and technical-material reading, including iterative coding rounds and consensus discussions among the authors to resolve ambiguities. We did not compute formal inter-rater reliability statistics because the analysis was performed by a small expert team with overlapping domain knowledge; we will instead describe the bias-mitigation steps taken. These additions will allow readers to evaluate selection bias directly. revision: yes
-
Referee: [Results] Results (co-occurrence analysis): The reported co-occurrences (e.g., deeper coordination pairing with explicit context services, stronger execution environments with structured governance) lack exact quantitative metrics, statistical measures, or validation details, undermining assessment of their robustness and generalizability beyond the inspected corpus.
Authors: We accept that the co-occurrence claims would be more robust with quantitative backing. The revision will add exact counts and percentages for each highlighted co-occurrence (e.g., “Of the 45 projects exhibiting deeper coordination, 32 (71 %) also used explicit context services”). Where cell counts permit, we will include simple association measures such as support/confidence from association-rule mining or chi-square tests, accompanied by a clear statement of the exploratory character of the analysis and its limited generalizability beyond the 70-project corpus. This will enable readers to assess the strength of the reported regularities. revision: yes
-
Referee: [Patterns synthesis] Patterns synthesis section: The five architectural patterns are synthesized from qualitative interpretation of the corpus; without explicit mapping tables or traceable links from the 70 projects' data to each pattern, the claim that these represent recurring regularities remains under-supported.
Authors: The five patterns emerged from qualitative clustering of the observed design-dimension combinations. To make the synthesis traceable, the revised paper will include (as an appendix) an explicit mapping table. For each pattern we will list 3–5 representative projects drawn from the 70, together with the concrete data points (specific choices on subagent architecture, context strategy, tool registration, safety level, and orchestration style) that align with the pattern description. This will provide verifiable links between the raw corpus observations and the synthesized patterns. revision: yes
Circularity Check
No circularity: purely observational empirical survey
full rationale
The paper performs a protocol-guided reading of source code and technical materials from 70 publicly available projects to identify recurring design dimensions and architectural patterns. No mathematical derivations, fitted parameters, predictions, or self-citations are present in the described method or claims. The five dimensions and five patterns are extracted directly from the inspected corpus without any reduction to prior self-referential results or definitional loops. This is a standard descriptive study whose validity rests on sampling and observation rather than any internal derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 70 selected publicly available agent-system projects form a representative sample of architectural decisions in the field.
Reference graph
Works this paper leans on
-
[1]
A Survey on Large Language Model based Autonomous Agents
Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Ji-Rong. A Survey on Large Language Model based Autonomous Agents. 2024
2024
-
[2]
AgentBench: Evaluating LLMs as Agents
Liu, Xiao and Xu, Hanchi and Wu, Lei and others. AgentBench: Evaluating LLMs as Agents. 2023
2023
-
[3]
AgentScope
ModelScope / Tongyi Lab. AgentScope. 2024.https://github.com/ modelscope/agentscope
2024
-
[4]
Significant-Gravitas. AutoGPT. 2023.https://github.com/ Significant-Gravitas/AutoGPT
2023
-
[5]
ChatDev: Communicative Agents for Software Development
Qian, Chen and Cong, Xin and Liu, Wei and Yang, Cheng and Chen, Weize and Su, Yusheng and Dang, Yufan and Li, Jiahao and Xu, Juyuan and Li, Dahai and Liu, Zhiyuan and Sun, Maosong. ChatDev: Communicative Agents for Software Development. 2023
2023
-
[6]
Claude Code
Anthropic. Claude Code. 2024. Anthropic product; closed-source coding agent assis- tant. Claude Code is a closed-source commercial product by Anthropic, not an academic publication. 35
2024
-
[7]
and Bachmann, Felix and Bass, Len and Garlan, David and Ivers, James and Little, Reed and Merson, Paulo and Nord, Robert and Stafford, Judith A
Clements, Paul C. and Bachmann, Felix and Bass, Len and Garlan, David and Ivers, James and Little, Reed and Merson, Paulo and Nord, Robert and Stafford, Judith A.. Documenting Software Architectures: Views and Beyond. 2010
2010
-
[8]
Joao Moura. CrewAI. 2023.https://github.com/joaomdmoura/crewai
2023
-
[9]
docker-agent
Docker. docker-agent. 2024.https://github.com/docker/docker-agent
2024
-
[10]
Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents
Gan, Yuyou and Yang, Yong and Ma, Zhe and He, Ping and Zeng, Rui and Wang, Yiming and Li, Qingming and Zhou, Chunyi and Li, Songze and Wang, Ting and Gao, Yunjun and Wu, Yingcai and Ji, Shouling. Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents. 2024
2024
-
[11]
and Cai, Cameron J
Park, Joon Sung and O’Brien, Jacob C. and Cai, Cameron J. and Morris, Meredith Ring and Liang, Percy and Bernstein, Michael S.. Generative Agents: Interactive Simulacra of Human Behavior. 2023
2023
-
[12]
Software Architecture as a Set of Architectural Design Decisions
Jansen, Anton and Bosch, Jan. Software Architecture as a Set of Architectural Design Decisions. 2006
2006
-
[13]
LangChain
LangChain Team. LangChain. 2023.https://github.com/langchain-ai/ langchain
2023
-
[14]
LangGraph
LangChain Team. LangGraph. 2024.https://github.com/langchain-ai/ langgraph
2024
-
[15]
LlamaIndex
LlamaIndex. LlamaIndex. 2024.https://www.llamaindex.ai/llamaindex
2024
-
[16]
Model Context Pro- tocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers
Hou, Xinyi and Zhao, Yanjie and Wang, Shenao and Wang, Haoyu. Model Context Pro- tocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers. 2025
2025
-
[17]
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits
Radosevich, Brandon and Halloran, John. MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits. 2025
2025
-
[18]
MetaGPT: Meta Programming for Multi-Agent Collaborative Framework
Hong, Sirui and Zhuge, Mingchen and Chen, Jiaqi and Zheng, Xiawu and Cheng, Yuheng and Zhang, Ceyao and Wang, Jinlin and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, Jurgen. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. 2023
2023
-
[19]
OpenClaw: Your Own Personal AI Assistant
OpenClaw Team. OpenClaw: Your Own Personal AI Assistant. 2024.https:// github.com/openclaw/openclaw. A self-hosted personal AI assistant platform supporting multi-channel messaging (WhatsApp, Telegram, Slack, Discord, etc.), browser automation, and skill-based extensibility
2024
-
[20]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Wang, Xingyao and others. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. 2024. arXiv preprint arXiv:2407.16741
work page internal anchor Pith review arXiv 2024
-
[21]
Plan, Props, and Models of Large Language Models: A Survey
Hao, Shibo and Gu, Yao and Cao, Haodi and others. Plan, Props, and Models of Large Language Models: A Survey. 2023. 36
2023
-
[22]
ReAct: Synergizing Reasoning and Acting in Language Models
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Hao, Kefan and Xiao, Chi. ReAct: Synergizing Reasoning and Acting in Language Models. 2022
2022
-
[23]
and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik R
Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik R. and Press, Ofir. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. 2024
2024
-
[24]
SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez, Carlos and Yang, John and Wettig, Alexander and others. SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?. 2023
2023
-
[25]
tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Yao, Shunyu and Shinn, Noah and Razavi, Pedram and Narasimhan, Karthik. tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024
2024
-
[26]
Past and Future of Software Architectural Decisions: A Systematic Mapping Study
Tofan, Dan and Galster, Matthias and Avgeriou, Paris and Schuitema, Wes. Past and Future of Software Architectural Decisions: A Systematic Mapping Study. 2014
2014
-
[27]
Tool- former: Language Models Can Teach Themselves to Use Tools
Schick, Timo and Dwivedi-Yu, Jane and Dessi, Roberto and Raileanu, Roberta and Lomeli, Maria and Zettlemoyer, Luke and Caneb, Nicola and Scialom, Thibault. Tool- former: Language Models Can Teach Themselves to Use Tools. 2023
2023
-
[28]
and Wei, Jessica and others
Zhou, Shuyan and Arrorb, Frank F. and Wei, Jessica and others. WebArena: A Realistic Web Environment for Building Autonomous Agents. 2023. 37 A Project Corpus A.1 Project Selection and Data Collection We selected projects according to the following criteria: (1) publicly available implementa- tion evidence, typically source code on platforms such as GitHu...
2023
-
[29]
public repository search, primarily on GitHub and comparable public code-hosting sur- faces
-
[30]
references encountered during related-work review and framework documentation read- ing
-
[31]
manual review of widely discussed Agent frameworks, coding agents, and orchestration tools; and
-
[32]
AI agent
snowball expansion from already retained projects to adjacent frameworks, forks, com- petitors, and explicitly referenced alternatives. C.2 Keyword Families Repository search and manual discovery repeatedly used keyword families such as the follow- ing: • “AI agent” • “agent” • “harness” • “vibe coding” • “agent framework” • “agent platform” • “agent syst...
-
[33]
publicly available implementation evidence, detailed public technical documentation, or other source-visible material sufficient for architectural inspection
-
[34]
meaningful execution-substrate functionality such as tool mediation, orchestration, per- sistence, workspace interaction, reusable delegation, or governance controls
-
[35]
implementation scale usually exceeding approximately 500 lines of code, or an equiva- lent architectural footprint for public comparison cases; and
-
[36]
enough observable structure to support comparative coding under the paper’s five focal dimensions. Projects were excluded when they were primarily prompt collections, thin API wrappers, benchmark-only artifacts, very small demonstrations, or products lacking enough public im- plementation evidence to support source- or paper-grounded coding. C.5 Corpus Fr...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.