pith. machine review for the scientific record. sign in

arxiv: 2604.14228 · v1 · submitted 2026-04-14 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

Recognition: unknown

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG
keywords Claude CodeAI agent systemsagent architecturedesign principlespermission systemscontext managementextensibility mechanismsagentic coding tools
0
0 comments X

The pith

Claude Code's architecture is shaped by five human values that lead to concrete choices in permissions, context management, and extensibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the publicly available TypeScript source code of Claude Code to map its architecture back to motivating human values and philosophies. It argues that five key values—human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability—guide the system and are realized through thirteen design principles. These values connect to specific features such as a seven-mode permission system with an ML classifier, a five-layer compaction pipeline, four extensibility mechanisms, subagent worktree isolation, and append-oriented session storage. A side-by-side comparison with OpenClaw shows how the same design questions produce different architectural answers in different deployment contexts. The analysis concludes by listing six open directions for future agent systems.

Core claim

Claude Code centers on a simple while-loop that calls the model, runs tools, and repeats, yet most of its code resides in surrounding systems: a permission framework with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. The authors trace these elements to five human values and thirteen design principles, then contrast the resulting architecture with OpenClaw to illustrate how deployment context alters the concrete answers to recurring design questions.

What carries the argument

The core while-loop for model-tool iteration, surrounded by a permission system, compaction pipeline, extensibility mechanisms, subagent isolation, and append-only session storage that together realize the design principles.

If this is right

  • The same design questions yield different architectural answers when deployment context changes from CLI to gateway.
  • Per-action safety classification versus perimeter-level access control represents a key divergence driven by context.
  • Context-window extensions versus gateway-wide capability registration address similar needs with different mechanisms.
  • Future agent systems should explicitly address the six open design directions identified from empirical, architectural, and policy literature.
  • Tracing values through principles to implementations provides a reusable lens for evaluating other agentic coding tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This value-to-implementation mapping could serve as a checklist for open-source agent developers to audit alignment with user priorities.
  • Policy discussions around AI agents could reference these principles when balancing automation with human oversight requirements.
  • Empirical user studies might test whether systems explicitly built on these values produce higher trust or fewer errors in long coding sessions.
  • The approach could extend to partial analyses of other commercial agents if documentation or API behavior is made available.

Load-bearing premise

That the publicly available TypeScript source code and the authors' interpretive mapping sufficiently reveal the true motivating human values and that the thirteen design principles comprehensively capture the architecture without selection bias.

What would settle it

A line-by-line review of the Claude Code TypeScript codebase that finds the permission modes, compaction layers, or other listed components bear no traceable link to the five stated values, or that identifies major architectural pieces absent from the thirteen principles.

read the original abstract

Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open-source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. A comparison with OpenClaw, a multi-channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per-action safety classification to perimeter-level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context-window extensions to gateway-wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the architecture of Claude Code, an agentic coding tool, by inspecting its publicly available TypeScript source code. It identifies five motivating human values (human decision authority, safety and security, reliable execution, capability amplification, contextual adaptability) and traces them through thirteen design principles to concrete mechanisms including a seven-mode permission system with ML classifier, five-layer context compaction pipeline, four extensibility mechanisms (MCP, plugins, skills, hooks), subagent worktree isolation, and append-oriented session storage. A comparison with OpenClaw illustrates how the same design questions yield different solutions under different deployment contexts, and the work concludes with six open design directions for future AI agent systems grounded in empirical and policy literature.

Significance. If the interpretive mapping holds, the paper offers a useful case study for software engineering researchers and practitioners working on AI agents, by concretely linking high-level values to low-level implementation choices and showing context-dependent trade-offs via the OpenClaw contrast. The enumeration of open directions provides a starting point for future work. The analysis is grounded in real artifacts rather than abstract models, which strengthens its potential utility for system designers.

major comments (2)
  1. [Sections describing the value-to-principle tracing and source-code analysis] The central claim—that five specific human values motivate the architecture and are systematically traced through thirteen design principles—rests on manual code inspection without any described formal method (e.g., coding protocol, decision criteria, or inter-rater process) for principle extraction. This interpretive step is load-bearing for the narrative and the subsequent OpenClaw comparison.
  2. [Comparison with OpenClaw] The comparison section asserts that deployment context produces different architectural answers (e.g., per-action safety classification vs. perimeter control), but provides no systematic evaluation framework or metrics to substantiate that the differences are attributable to context rather than author selection.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction would benefit from an explicit statement of the analysis scope (e.g., which version of the TypeScript codebase was examined and the date of inspection) to aid reproducibility.
  2. [Implementation details of the permission system] Figure captions and the description of the seven-mode permission system could clarify how the ML-based classifier integrates with the modes, as the current text leaves the interaction somewhat implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the transparency of our interpretive analysis and the framing of the OpenClaw comparison. We address each major comment below and outline targeted revisions.

read point-by-point responses
  1. Referee: [Sections describing the value-to-principle tracing and source-code analysis] The central claim—that five specific human values motivate the architecture and are systematically traced through thirteen design principles—rests on manual code inspection without any described formal method (e.g., coding protocol, decision criteria, or inter-rater process) for principle extraction. This interpretive step is load-bearing for the narrative and the subsequent OpenClaw comparison.

    Authors: We agree that greater methodological transparency would strengthen the paper. The analysis was performed via iterative manual inspection of the publicly available TypeScript source, beginning with identification of the core model-tool loop and then examining surrounding subsystems (permissions, context compaction, extensibility, delegation, and storage) for recurring patterns. Values were mapped based on alignment between implementation choices and documented design goals in code comments, error handling, and user-facing safeguards. To address the concern, we will add a dedicated 'Analysis Methodology' subsection that explicitly describes this process, the decision criteria for extracting the thirteen principles, and the rationale for linking them to the five values. This will make the interpretive steps reproducible in principle while preserving the exploratory nature of the study. revision: yes

  2. Referee: [Comparison with OpenClaw] The comparison section asserts that deployment context produces different architectural answers (e.g., per-action safety classification vs. perimeter control), but provides no systematic evaluation framework or metrics to substantiate that the differences are attributable to context rather than author selection.

    Authors: The OpenClaw section is presented as an illustrative contrast to demonstrate how identical design questions receive different answers under different deployment constraints, rather than as a controlled empirical comparison. We do not claim statistical attribution or provide quantitative metrics because the intent is to surface concrete trade-offs for system designers. We will revise the section to (a) explicitly label it as an illustrative case study, (b) add a side-by-side table summarizing the six recurring design questions and their resolutions in each system, and (c) include a short limitations paragraph acknowledging that observed differences may also reflect project scope, developer priorities, and implementation timelines in addition to deployment context. revision: partial

Circularity Check

0 steps flagged

No circularity: purely descriptive mapping of external code artifacts

full rationale

The paper's central activity is manual inspection of publicly available TypeScript source code for Claude Code, followed by interpretive labeling of observed mechanisms with five human values and thirteen design principles. No equations, fitted parameters, predictions, or first-principles derivations exist. The claimed tracing from values to principles to implementations is performed by author inspection rather than by any self-referential reduction or self-citation chain that would make the output equivalent to the input by construction. External comparison with OpenClaw is likewise descriptive. Per the evaluation rules, this is self-contained interpretive analysis with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the premise that source-code inspection can reliably surface human values and design principles; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Publicly available TypeScript source code accurately and completely represents the deployed system's design decisions and motivations.
    The entire tracing of values through principles to implementations depends on this premise.

pith-pipeline@v0.9.0 · 5585 in / 1333 out tokens · 45899 ms · 2026-05-10T14:27:55.357804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

    cs.AI 2026-05 unverdicted novelty 6.0

    An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...

  2. HARBOR: Automated Harness Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.

  3. Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification

    cs.CY 2026-04 unverdicted novelty 4.0

    DEMM defines four executable evidence-sufficiency categories plus a conflicting category for agentic AI decisions and rolls per-property verdicts into a five-level maturity rubric.

Reference graph

Works this paper leans on

53 extracted references · 38 canonical work pages · cited by 3 Pith papers · 24 internal anchors

  1. [1]

    Anthropic PBC, no

    Bartz v. Anthropic PBC, no. 3:24-cv-05417-WHA. U.S. District Court for the Northern District of California, Order on Motion for Summary Judgment (June 23, 2025), Alsup, J. Court docket:https://www.courtlistener.com/docket/ 69058235/bartz-v-anthropic-pbc/,

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691,

  3. [3]

    The vibe-check protocol: Quantifying cognitive offloading in ai programming.arXiv preprint arXiv:2601.02410,

    Aizierjiang Aiersilan. The vibe-check protocol: Quantifying cognitive offloading in ai programming.arXiv preprint arXiv:2601.02410,

  4. [4]

    InversePrompt: Turning claude against itself, one prompt at a time

    Elad Beber. InversePrompt: Turning claude against itself, one prompt at a time. https://cymulate.com/blog/ cve-2025-547954-54795-claude-inverseprompt/,

  5. [5]

    CVE-2025-54794, CVE-2025-54795; updated April 6,

  6. [6]

    Measuring the impact of early-2025 AI on experienced open-source developer productivity.CoRR, abs/2507.09089, 2025

    Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the impact of early-2025 ai on experienced open-source developer productivity.arXiv preprint arXiv:2507.09089,

  7. [7]

    International ai safety report 2026.arXiv preprint arXiv:2602.21012, 2026

    Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, et al. International ai safety report 2026.arXiv preprint arXiv:2602.21012,

  8. [8]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

  9. [9]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  10. [10]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    38 Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818, 1:2,

  11. [11]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657,

  12. [12]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  13. [13]

    Need help? designing proactive ai assistants for programming

    Valerie Chen, Alan Zhu, Sebastian Zhao, Hussein Mozannar, David Sontag, and Ameet Talwalkar. Need help? designing proactive ai assistants for programming. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18,

  14. [14]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

  15. [15]

    Caught in the hook: RCE and API token ex- filtration through Claude Code project files

    Aviv Donenfeld and Oded Vanunu. Caught in the hook: RCE and API token ex- filtration through Claude Code project files. https://research.checkpoint.com/2026/ rce-and-api-token-exfiltration-through-claude-code-project-files-cve-2025-59536/ ,

  16. [16]

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch

    CVE-2025-59536 (CVSS 8.7), CVE-2026-21852 (CVSS 5.3). Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning,

  17. [17]

    Towards an AI co-scientist

    Paul Gauthier. Aider: AI pair programming in your terminal, 2024.https://github.com/Aider-AI/aider. Open-source software,https://aider.chat. 39 Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864,

  18. [18]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xian- gliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680,

  19. [19]

    Does ai-assisted coding deliver? a difference-in-differences study of cursor’s impact on software projects.CoRR, abs/2511.04427, 2025

    Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. Speed at the cost of quality: How cursor ai increases short-term velocity and long-term complexity in open-source projects.arXiv preprint arXiv:2511.04427,

  20. [20]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435,

  21. [21]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564,

  22. [22]

    Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052, 2026

    Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, et al. Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052,

  23. [23]

    Agents.https://huyenchip.com/2025/01/07/agents.html,

    Chip Huyen. Agents.https://huyenchip.com/2025/01/07/agents.html,

  24. [24]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

  25. [25]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter. arXiv preprint arXiv:2407.01502,

  26. [26]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    November 2023; popularizes the LLM-as-OS framing. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

  27. [27]

    Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task

    Nataliya Kosmyna, Eugene Hauptmann, Ye Tong Yuan, Jessica Situ, Xian-Hao Liao, Ashly Vivian Beresnitzky, Iris Braunstein, and Pattie Maes. Your brain on chatgpt: Accumulation of cognitive debt when using an ai assistant for essay writing task.arXiv preprint arXiv:2506.08872, 4,

  28. [28]

    LangGraph: Build resilient language agents as graphs, 2024.https://github.com/langchain-ai/ langgraph

    LangChain, Inc. LangGraph: Build resilient language agents as graphs, 2024.https://github.com/langchain-ai/ langgraph. GitHub repository. Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, et al. Sensible agent: A framework for unobtrusive interaction with proactive ar agents...

  29. [29]

    Encouraging divergent thinking in large language models through multi-agent debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904,

  30. [30]

    Proactive conversational agents with inner thoughts

    Xingyu Bruce Liu, Shitao Fang, Weiyan Shi, Chien-Sheng Wu, Takeo Igarashi, and Xiang’Anthony’ Chen. Proactive conversational agents with inner thoughts. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19,

  31. [31]

    Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild

    Yue Liu, Ratnadira Widyasari, Yanjie Zhao, Ivana Clairine Irsan, and David Lo. Debt behind the ai boom: A large-scale empirical study of ai-generated code in the wild.arXiv preprint arXiv:2603.28592,

  32. [32]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

  33. [33]

    Agent design patterns.https://rlancemartin.github.io/2026/01/09/agent_design/,

    Lance Martin. Agent design patterns.https://rlancemartin.github.io/2026/01/09/agent_design/,

  34. [34]

    AI Agents Under EU Law

    Luca Nannini, Adam Leon Smith, Michele Joshua Maggini, Enrico Panai, Sandra Feliciano, Aleksandr Tiulkanov, Elena Maran, James Gealy, and Piercosma Bisconti. Ai agents under eu law.arXiv preprint arXiv:2604.04604,

  35. [35]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

  36. [36]

    Beyond reactivity: Measuring proactive problem solving in llm agents

    Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, and Ash Lewis. Beyond reactivity: Measuring proactive problem solving in llm agents.arXiv preprint arXiv:2510.19771,

  37. [37]

    Pathak, H

    Divya Pathak, Harshit Kumar, Anuska Roy, Felix George, Mudit Verma, and Pratibha Moogi. Detecting silent failures in multi-agentic ai trajectories.arXiv preprint arXiv:2511.04032,

  38. [38]

    Do users write more insecure code with ai assistants? InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pages 2785–2799,

    Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with ai assistants? InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pages 2785–2799,

  39. [39]

    Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support

    Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, and Yan Chen. Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support. InProceedings of the 2025 CHI conference on human factors in computing systems, pages 1–21,

  40. [40]

    How to stay ahead of AI as an early-career engineer.IEEE Spectrum, 2025.https://spectrum.ieee

    Gwendolyn Rak. How to stay ahead of AI as an early-career engineer.IEEE Spectrum, 2025.https://spectrum.ieee. org/ai-effect-entry-level-jobs. Charles Reis and Steven D Gribble. Isolating web programs in modern browser architectures. InProceedings of the 4th ACM European conference on Computer systems, pages 219–232,

  41. [41]

    How ai impacts skill formation.arXiv preprint arXiv:2601.20245,

    Judy Hanwen Shen and Alex Tamkin. How ai impacts skill formation.arXiv preprint arXiv:2601.20245,

  42. [42]

    The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

    Leon Staufer, Kevin Feng, Kevin Wei, Luke Bailey, Yawen Duan, Mick Yang, A Pinar Ozisik, Stephen Casper, and Noam Kolt. The 2025 ai agent index: Documenting technical and safety features of deployed agentic ai systems. arXiv preprint arXiv:2602.17753,

  43. [43]

    Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study

    Open-source multi-channel AI assistant gateway. MIT License. Viktoria Stray, Elias Goldmann Brandtzæg, Viggo Tellefsen Wivestad, Astri Barbala, and Nils Brede Moe. Devel- oper productivity with and without github copilot: A longitudinal mixed-methods case study.arXiv preprint arXiv:2509.20353,

  44. [44]

    Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

    Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897,

  45. [45]

    Training proactive and personalized llm agents.arXiv preprint arXiv:2511.02208, 2025

    Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, and Yiming Yang. Training proactive and personalized llm agents.arXiv preprint arXiv:2511.02208,

  46. [46]

    com/atlas/ai-infrastructure-roadmap-five-frontiers-for-2026,

    Bessemer Venture Partners,https://www.bvp. com/atlas/ai-infrastructure-roadmap-five-frontiers-for-2026,

  47. [47]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

  48. [48]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024b. Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv...

  49. [49]

    Qing Xiao, Xinlan Emily Hu, Mark E Whiting, Arvind Karunakaran, Hong Shen, and Hancheng Cao. Ai hasn’t fixed teamwork, but it shifted collaborative culture: A longitudinal study in a project-based software development organization (2023-2025).arXiv preprint arXiv:2509.10956,

  50. [50]

    Ai agent systems: Architectures, applications, and evaluation,

    Bin Xu. Ai agent systems: Architectures, applications, and evaluation.arXiv preprint arXiv:2601.01743,

  51. [51]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

  52. [52]

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

  53. [53]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025a. Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, an...