arxiv: 2604.14228 · v1 · submitted 2026-04-14 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

Recognition: unknown

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang, Zhiqiang Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:27 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG

keywords Claude CodeAI agent systemsagent architecturedesign principlespermission systemscontext managementextensibility mechanismsagentic coding tools

0 comments

The pith

Claude Code's architecture is shaped by five human values that lead to concrete choices in permissions, context management, and extensibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the publicly available TypeScript source code of Claude Code to map its architecture back to motivating human values and philosophies. It argues that five key values—human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability—guide the system and are realized through thirteen design principles. These values connect to specific features such as a seven-mode permission system with an ML classifier, a five-layer compaction pipeline, four extensibility mechanisms, subagent worktree isolation, and append-oriented session storage. A side-by-side comparison with OpenClaw shows how the same design questions produce different architectural answers in different deployment contexts. The analysis concludes by listing six open directions for future agent systems.

Core claim

Claude Code centers on a simple while-loop that calls the model, runs tools, and repeats, yet most of its code resides in surrounding systems: a permission framework with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. The authors trace these elements to five human values and thirteen design principles, then contrast the resulting architecture with OpenClaw to illustrate how deployment context alters the concrete answers to recurring design questions.

What carries the argument

The core while-loop for model-tool iteration, surrounded by a permission system, compaction pipeline, extensibility mechanisms, subagent isolation, and append-only session storage that together realize the design principles.

If this is right

The same design questions yield different architectural answers when deployment context changes from CLI to gateway.
Per-action safety classification versus perimeter-level access control represents a key divergence driven by context.
Context-window extensions versus gateway-wide capability registration address similar needs with different mechanisms.
Future agent systems should explicitly address the six open design directions identified from empirical, architectural, and policy literature.
Tracing values through principles to implementations provides a reusable lens for evaluating other agentic coding tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This value-to-implementation mapping could serve as a checklist for open-source agent developers to audit alignment with user priorities.
Policy discussions around AI agents could reference these principles when balancing automation with human oversight requirements.
Empirical user studies might test whether systems explicitly built on these values produce higher trust or fewer errors in long coding sessions.
The approach could extend to partial analyses of other commercial agents if documentation or API behavior is made available.

Load-bearing premise

That the publicly available TypeScript source code and the authors' interpretive mapping sufficiently reveal the true motivating human values and that the thirteen design principles comprehensively capture the architecture without selection bias.

What would settle it

A line-by-line review of the Claude Code TypeScript codebase that finds the permission modes, compaction layers, or other listed components bear no traceable link to the five stated values, or that identifies major architectural pieces absent from the thirteen principles.

read the original abstract

Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open-source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. A comparison with OpenClaw, a multi-channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per-action safety classification to perimeter-level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context-window extensions to gateway-wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Claude Code's internals are dissected usefully here with a solid OpenClaw comparison, but the human-values-to-principles link lacks a reproducible extraction method.

read the letter

Claude Code gets a detailed architectural breakdown from its open TypeScript code, paired with a comparison to OpenClaw that shows how context changes the design answers. That is the useful part. The paper does a good job laying out the core while-loop and then the surrounding pieces: the permission system with seven modes and an ML classifier, the five-layer compaction for context, the four extensibility mechanisms, subagent isolation, and append-only storage. The OpenClaw contrast highlights differences like perimeter access control versus per-action checks and embedded runtime versus CLI loop. The five values and thirteen principles give a way to tie these choices back to human needs like safety and adaptability, and the six open directions are reasonable. The soft spot is the extraction process. The authors state the values and principles after examining the code, but no formal scheme, inter-rater check, or decision tree is described. This leaves room for selection bias, and the same mapping applies to the comparison. It is not fatal, but it means the tracing is more author synthesis than independently verifiable. This paper is for people in software engineering and AI who want to see how real agent systems handle the practical problems of permissions, context, and delegation. A reader looking for implementation details and trade-offs will get value from it. It deserves a serious referee because the case study is grounded in actual code and adds a comparative angle that is not common. I recommend sending it to peer review. The referees can ask for more transparency on how the principles were identified, which would strengthen the work without changing its core contribution.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the architecture of Claude Code, an agentic coding tool, by inspecting its publicly available TypeScript source code. It identifies five motivating human values (human decision authority, safety and security, reliable execution, capability amplification, contextual adaptability) and traces them through thirteen design principles to concrete mechanisms including a seven-mode permission system with ML classifier, five-layer context compaction pipeline, four extensibility mechanisms (MCP, plugins, skills, hooks), subagent worktree isolation, and append-oriented session storage. A comparison with OpenClaw illustrates how the same design questions yield different solutions under different deployment contexts, and the work concludes with six open design directions for future AI agent systems grounded in empirical and policy literature.

Significance. If the interpretive mapping holds, the paper offers a useful case study for software engineering researchers and practitioners working on AI agents, by concretely linking high-level values to low-level implementation choices and showing context-dependent trade-offs via the OpenClaw contrast. The enumeration of open directions provides a starting point for future work. The analysis is grounded in real artifacts rather than abstract models, which strengthens its potential utility for system designers.

major comments (2)

[Sections describing the value-to-principle tracing and source-code analysis] The central claim—that five specific human values motivate the architecture and are systematically traced through thirteen design principles—rests on manual code inspection without any described formal method (e.g., coding protocol, decision criteria, or inter-rater process) for principle extraction. This interpretive step is load-bearing for the narrative and the subsequent OpenClaw comparison.
[Comparison with OpenClaw] The comparison section asserts that deployment context produces different architectural answers (e.g., per-action safety classification vs. perimeter control), but provides no systematic evaluation framework or metrics to substantiate that the differences are attributable to context rather than author selection.

minor comments (2)

[Abstract and Introduction] The abstract and introduction would benefit from an explicit statement of the analysis scope (e.g., which version of the TypeScript codebase was examined and the date of inspection) to aid reproducibility.
[Implementation details of the permission system] Figure captions and the description of the seven-mode permission system could clarify how the ML-based classifier integrates with the modes, as the current text leaves the interaction somewhat implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the transparency of our interpretive analysis and the framing of the OpenClaw comparison. We address each major comment below and outline targeted revisions.

read point-by-point responses

Referee: [Sections describing the value-to-principle tracing and source-code analysis] The central claim—that five specific human values motivate the architecture and are systematically traced through thirteen design principles—rests on manual code inspection without any described formal method (e.g., coding protocol, decision criteria, or inter-rater process) for principle extraction. This interpretive step is load-bearing for the narrative and the subsequent OpenClaw comparison.

Authors: We agree that greater methodological transparency would strengthen the paper. The analysis was performed via iterative manual inspection of the publicly available TypeScript source, beginning with identification of the core model-tool loop and then examining surrounding subsystems (permissions, context compaction, extensibility, delegation, and storage) for recurring patterns. Values were mapped based on alignment between implementation choices and documented design goals in code comments, error handling, and user-facing safeguards. To address the concern, we will add a dedicated 'Analysis Methodology' subsection that explicitly describes this process, the decision criteria for extracting the thirteen principles, and the rationale for linking them to the five values. This will make the interpretive steps reproducible in principle while preserving the exploratory nature of the study. revision: yes
Referee: [Comparison with OpenClaw] The comparison section asserts that deployment context produces different architectural answers (e.g., per-action safety classification vs. perimeter control), but provides no systematic evaluation framework or metrics to substantiate that the differences are attributable to context rather than author selection.

Authors: The OpenClaw section is presented as an illustrative contrast to demonstrate how identical design questions receive different answers under different deployment constraints, rather than as a controlled empirical comparison. We do not claim statistical attribution or provide quantitative metrics because the intent is to surface concrete trade-offs for system designers. We will revise the section to (a) explicitly label it as an illustrative case study, (b) add a side-by-side table summarizing the six recurring design questions and their resolutions in each system, and (c) include a short limitations paragraph acknowledging that observed differences may also reflect project scope, developer priorities, and implementation timelines in addition to deployment context. revision: partial

Circularity Check

0 steps flagged

No circularity: purely descriptive mapping of external code artifacts

full rationale

The paper's central activity is manual inspection of publicly available TypeScript source code for Claude Code, followed by interpretive labeling of observed mechanisms with five human values and thirteen design principles. No equations, fitted parameters, predictions, or first-principles derivations exist. The claimed tracing from values to principles to implementations is performed by author inspection rather than by any self-referential reduction or self-citation chain that would make the output equivalent to the input by construction. External comparison with OpenClaw is likewise descriptive. Per the evaluation rules, this is self-contained interpretive analysis with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the premise that source-code inspection can reliably surface human values and design principles; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Publicly available TypeScript source code accurately and completely represents the deployed system's design decisions and motivations.
The entire tracing of values through principles to implementations depends on this premise.

pith-pipeline@v0.9.0 · 5585 in / 1333 out tokens · 45899 ms · 2026-05-10T14:27:55.357804+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
cs.AI 2026-05 unverdicted novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
HARBOR: Automated Harness Optimization
cs.LG 2026-04 unverdicted novelty 6.0

HARBOR formalizes harness optimization as constrained noisy Bayesian optimization over mixed-variable spaces and reports a case study where it outperforms manual tuning on a production coding agent.
Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification
cs.CY 2026-04 unverdicted novelty 4.0

DEMM defines four executable evidence-sufficiency categories plus a conflicting category for agentic AI decisions and rolls per-property verdicts into a five-level maturity rubric.

Reference graph

Works this paper leans on

53 extracted references · 38 canonical work pages · cited by 3 Pith papers · 23 internal anchors

[1]

Anthropic PBC, no

Bartz v. Anthropic PBC, no. 3:24-cv-05417-WHA. U.S. District Court for the Northern District of California, Order on Motion for Summary Judgment (June 23, 2025), Alsup, J. Court docket:https://www.courtlistener.com/docket/ 69058235/bartz-v-anthropic-pbc/,

2025
[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691,

work page internal anchor Pith review arXiv
[3]

The vibe-check protocol: Quantifying cognitive offloading in ai programming.arXiv preprint arXiv:2601.02410,

Aizierjiang Aiersilan. The vibe-check protocol: Quantifying cognitive offloading in ai programming.arXiv preprint arXiv:2601.02410,

work page arXiv
[4]

InversePrompt: Turning claude against itself, one prompt at a time

Elad Beber. InversePrompt: Turning claude against itself, one prompt at a time. https://cymulate.com/blog/ cve-2025-547954-54795-claude-inverseprompt/,

2025
[5]

CVE-2025-54794, CVE-2025-54795; updated April 6,

2025
[6]

Measuring the impact of early-2025 AI on experienced open-source developer productivity.CoRR, abs/2507.09089, 2025

Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the impact of early-2025 ai on experienced open-source developer productivity.arXiv preprint arXiv:2507.09089,

work page arXiv 2025
[7]

International ai safety report 2026.arXiv preprint arXiv:2602.21012, 2026

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, et al. International ai safety report 2026.arXiv preprint arXiv:2602.21012,

work page arXiv 2026
[8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,

work page internal anchor Pith review arXiv
[9]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

38 Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.URL https://arxiv. org/abs/2307.15818, 1:2,

work page internal anchor Pith review arXiv 2023
[11]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657,

work page internal anchor Pith review arXiv
[12]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Need help? designing proactive ai assistants for programming

Valerie Chen, Alan Zhu, Sebastian Zhao, Hussein Mozannar, David Sontag, and Ameet Talwalkar. Need help? designing proactive ai assistants for programming. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18,

2025
[14]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review arXiv
[15]

Caught in the hook: RCE and API token ex- filtration through Claude Code project files

Aviv Donenfeld and Oded Vanunu. Caught in the hook: RCE and API token ex- filtration through Claude Code project files. https://research.checkpoint.com/2026/ rce-and-api-token-exfiltration-through-claude-code-project-files-cve-2025-59536/ ,

2026
[16]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch

CVE-2025-59536 (CVSS 8.7), CVE-2026-21852 (CVSS 5.3). Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning,

2025
[17]

Towards an AI co-scientist

Paul Gauthier. Aider: AI pair programming in your terminal, 2024.https://github.com/Aider-AI/aider. Open-source software,https://aider.chat. 39 Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864,

work page internal anchor Pith review arXiv 2024
[18]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xian- gliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review arXiv
[19]

Does ai-assisted coding deliver? a difference-in-differences study of cursor’s impact on software projects.CoRR, abs/2511.04427, 2025

Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. Speed at the cost of quality: How cursor ai increases short-term velocity and long-term complexity in open-source projects.arXiv preprint arXiv:2511.04427,

work page arXiv
[20]

arXiv preprint arXiv:2408.08435 , year=

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435,

work page arXiv
[21]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564,

work page internal anchor Pith review arXiv
[22]

Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052,

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, et al. Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052,

work page arXiv
[23]

Agents.https://huyenchip.com/2025/01/07/agents.html,

Chip Huyen. Agents.https://huyenchip.com/2025/01/07/agents.html,

2025
[24]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Kapoor and A

Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter. arXiv preprint arXiv:2407.01502,

work page arXiv
[26]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

November 2023; popularizes the LLM-as-OS framing. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review arXiv 2023
[27]

Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task

Nataliya Kosmyna, Eugene Hauptmann, Ye Tong Yuan, Jessica Situ, Xian-Hao Liao, Ashly Vivian Beresnitzky, Iris Braunstein, and Pattie Maes. Your brain on chatgpt: Accumulation of cognitive debt when using an ai assistant for essay writing task.arXiv preprint arXiv:2506.08872, 4,

work page arXiv
[28]

LangGraph: Build resilient language agents as graphs, 2024.https://github.com/langchain-ai/ langgraph

LangChain, Inc. LangGraph: Build resilient language agents as graphs, 2024.https://github.com/langchain-ai/ langgraph. GitHub repository. Geonsun Lee, Min Xia, Nels Numan, Xun Qian, David Li, Yanhe Chen, Achin Kulshrestha, Ishan Chatterjee, Yinda Zhang, Dinesh Manocha, et al. Sensible agent: A framework for unobtrusive interaction with proactive ar agents...

2024
[29]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904,

2024
[30]

Proactive conversational agents with inner thoughts

Xingyu Bruce Liu, Shitao Fang, Weiyan Shi, Chien-Sheng Wu, Takeo Igarashi, and Xiang’Anthony’ Chen. Proactive conversational agents with inner thoughts. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–19,

2025
[31]

Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild

Yue Liu, Ratnadira Widyasari, Yanjie Zhao, Ivana Clairine Irsan, and David Lo. Debt behind the ai boom: A large-scale empirical study of ai-generated code in the wild.arXiv preprint arXiv:2603.28592,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

work page internal anchor Pith review arXiv
[33]

Agent design patterns.https://rlancemartin.github.io/2026/01/09/agent_design/,

Lance Martin. Agent design patterns.https://rlancemartin.github.io/2026/01/09/agent_design/,

2026
[34]

AI Agents Under EU Law

Luca Nannini, Adam Leon Smith, Michele Joshua Maggini, Enrico Panai, Sandra Feliciano, Aleksandr Tiulkanov, Elena Maran, James Gealy, and Piercosma Bisconti. Ai agents under eu law.arXiv preprint arXiv:2604.04604,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131,

work page internal anchor Pith review arXiv
[36]

Beyond reactivity: Measuring proactive problem solving in llm agents

Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, and Ash Lewis. Beyond reactivity: Measuring proactive problem solving in llm agents.arXiv preprint arXiv:2510.19771,

work page arXiv
[37]

Pathak, H

Divya Pathak, Harshit Kumar, Anuska Roy, Felix George, Mudit Verma, and Pratibha Moogi. Detecting silent failures in multi-agentic ai trajectories.arXiv preprint arXiv:2511.04032,

work page arXiv
[38]

Do users write more insecure code with ai assistants? InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pages 2785–2799,

Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with ai assistants? InProceedings of the 2023 ACM SIGSAC conference on computer and communications security, pages 2785–2799,

2023
[39]

Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support

Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, and Yan Chen. Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support. InProceedings of the 2025 CHI conference on human factors in computing systems, pages 1–21,

2025
[40]

How to stay ahead of AI as an early-career engineer.IEEE Spectrum, 2025.https://spectrum.ieee

Gwendolyn Rak. How to stay ahead of AI as an early-career engineer.IEEE Spectrum, 2025.https://spectrum.ieee. org/ai-effect-entry-level-jobs. Charles Reis and Steven D Gribble. Isolating web programs in modern browser architectures. InProceedings of the 4th ACM European conference on Computer systems, pages 219–232,

2025
[41]

How ai impacts skill formation.arXiv preprint arXiv:2601.20245,

Judy Hanwen Shen and Alex Tamkin. How ai impacts skill formation.arXiv preprint arXiv:2601.20245,

work page arXiv
[42]

The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems

Leon Staufer, Kevin Feng, Kevin Wei, Luke Bailey, Yawen Duan, Mick Yang, A Pinar Ozisik, Stephen Casper, and Noam Kolt. The 2025 ai agent index: Documenting technical and safety features of deployed agentic ai systems. arXiv preprint arXiv:2602.17753,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study

Open-source multi-channel AI assistant gateway. MIT License. Viktoria Stray, Elias Goldmann Brandtzæg, Viggo Tellefsen Wivestad, Astri Barbala, and Nils Brede Moe. Devel- oper productivity with and without github copilot: A longitudinal mixed-methods case study.arXiv preprint arXiv:2509.20353,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, and Yuqing Yang. Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897,

work page arXiv
[45]

Training proactive and personalized llm agents.arXiv preprint arXiv:2511.02208, 2025

Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, and Yiming Yang. Training proactive and personalized llm agents.arXiv preprint arXiv:2511.02208,

work page arXiv
[46]

com/atlas/ai-infrastructure-roadmap-five-frontiers-for-2026,

Bessemer Venture Partners,https://www.bvp. com/atlas/ai-infrastructure-roadmap-five-frontiers-for-2026,

2026
[47]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024b. Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory.arXiv preprint arXiv...

work page internal anchor Pith review arXiv 2023
[49]

Qing Xiao, Xinlan Emily Hu, Mark E Whiting, Arvind Karunakaran, Hong Shen, and Hancheng Cao. Ai hasn’t fixed teamwork, but it shifted collaborative culture: A longitudinal study in a project-based software development organization (2023-2025).arXiv preprint arXiv:2509.10956,

work page arXiv 2023
[50]

Ai agent systems: Architectures, applications, and evaluation,

Bin Xu. Ai agent systems: Architectures, applications, and evaluation.arXiv preprint arXiv:2601.01743,

work page arXiv
[51]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110,

work page internal anchor Pith review arXiv
[52]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

work page internal anchor Pith review arXiv
[53]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025a. Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, an...

work page internal anchor Pith review arXiv