From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

Shangding Gu

arxiv: 2605.26112 · v1 · pith:6DHAX443new · submitted 2026-05-25 · 💻 cs.AI · cs.LG

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

Shangding Gu This is my paper

Pith reviewed 2026-06-29 21:27 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords agentic AIsystem scalingharness designfoundation modelsagent evaluationcontext governancetrustworthy memoryorchestration

0 comments

The pith

Agentic AI progress requires scaling the system harness around foundation models as much as improving the models themselves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current evaluation practices treat agent success as a property of the foundation model alone, but long-horizon behavior actually arises from how the model interacts with surrounding components for memory, context, skill selection, orchestration, and verification. It defines the agent harness as the structured execution layer that translates model outputs into reliable agent trajectories. Shifting focus to harness scaling means treating these components as objects of deliberate design, measurement, and optimization rather than implementation details. If this view holds, then benchmarks and research agendas must track system-level properties such as memory hygiene and verification cost in addition to task completion.

Core claim

Agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer, which together form the agent harness; therefore future advances depend on scaling this harness through better context governance, trustworthy memory, and dynamic skill routing alongside model improvements.

What carries the argument

The agent harness, the structured execution layer consisting of context governance, trustworthy memory, dynamic skill routing, orchestration, and governance mechanisms that together translate model capability into sustained agent behavior.

If this is right

Benchmarks must measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time rather than one-shot task success.
Research should target the three core bottlenecks of context governance, trustworthy memory, and dynamic skill routing.
Orchestration and governance mechanisms must be designed to coordinate and constrain the other harness components.
Evaluation frameworks should treat the full harness as the unit of analysis instead of isolating the foundation model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If harness interactions dominate, then two different foundation models paired with identical harness designs may produce similar agent performance.
Modular harness designs could allow components to be improved or replaced without retraining or replacing the underlying model.
Focus on harness scaling might reduce reliance on ever-larger models for achieving reliable long-horizon agent behavior in practice.

Load-bearing premise

Agent performance is primarily limited by interactions among the harness components rather than by the capabilities of the underlying foundation model.

What would settle it

A controlled comparison in which increasing only model size produces large gains in long-horizon tasks while changes to harness components produce little or no additional improvement.

Figures

Figures reproduced from arXiv: 2605.26112 by Shangding Gu.

read the original abstract

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Position paper pushes 'harness' framing for agent systems but asserts without data that it will matter as much as model scaling.

read the letter

The key takeaway here is that this paper is mostly a call to pay more attention to the infrastructure around LLMs in agents, but it doesn't back up the idea that this infrastructure is the next big scaling bottleneck with any hard numbers or experiments.

What stands out as new is the specific framing around 'scaling the harness' and the three bottlenecks they name: context governance, trustworthy memory, and dynamic skill routing. They also ship a Python reference implementation on GitHub, which is more than many position papers do. The discussion of why current evals are inadequate for long-horizon agents is on point — final answer accuracy doesn't tell you about memory leaks or verification overhead.

The paper does well at laying out a research agenda for new kinds of benchmarks that track trajectory quality and safe evolution. That could be helpful for people designing agent systems.

The main weakness is the lack of evidence for the central assertion. The authors say agent performance emerges from the interaction of harness components and that progress will depend as much on system design as on stronger models. But there's no data showing that. No controlled tests, no scaling laws for harness improvements, just the claim. The comparisons to Claude Code and OpenClaw are described qualitatively, which doesn't move the needle on whether harness factors are load-bearing.

This paper is aimed at researchers and engineers working on agent frameworks who are frustrated with model-only thinking. It might get people thinking about better evaluation, but it doesn't deliver new results that others can build on directly.

I wouldn't send it for peer review as is. It reads more like an extended blog post or workshop position statement. If the authors added some actual runs or measurements from CheetahClaws, it could be worth a second look.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that the next major bottleneck in agentic AI is system scaling via the 'agent harness'—the structured execution layer around foundation models encompassing context governance, trustworthy memory, dynamic skill routing, orchestration, and governance—rather than model scaling alone. Agent performance is said to emerge from interactions among these components; model-centric evaluations are critiqued as inadequate; a research agenda for harness-level benchmarks (trajectory quality, memory hygiene, context efficiency, verification cost, safe evolution) is outlined; and CheetahClaws is presented as a Python-native reference implementation with qualitative comparisons to Claude Code and OpenClaw. The central thesis is that future progress will depend as much on system design as on stronger foundation models.

Significance. If the central claim holds, the perspective could redirect agentic AI research toward holistic system architectures and evaluations that capture long-horizon behaviors beyond one-shot task success. The reference implementation and benchmark agenda supply a concrete foundation for follow-on work. The paper offers no empirical support, however, so its influence would depend on subsequent validation.

major comments (3)

[Abstract] Abstract: the claim that 'future progress in agentic AI will depend as much on system design as on stronger foundation models' is asserted without any scaling curves, ablation results, or quantitative comparisons showing that harness modifications produce gains comparable in magnitude to those from model scaling.
[CheetahClaws comparison] Section introducing CheetahClaws and comparisons: the qualitative discussion of CheetahClaws versus Claude Code and OpenClaw contains no performance metrics, trajectory measurements, or controlled variations of harness components (context governance, memory, routing) that would allow assessment of their relative impact.
[Research agenda] Research agenda paragraph: the proposed harness-level benchmarks (trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, safe evolution) are described but never executed or illustrated with even preliminary data, leaving the assertion that model-centric evaluation is inadequate without direct support.

minor comments (1)

[Introduction / harness definition] The components of the agent harness are listed repeatedly but never accompanied by a diagram or formal interface specification that would clarify data flow and dependencies among context constructor, memory substrate, skill-routing layer, and verification layer.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed review. This is a position paper whose goal is to articulate a hypothesis and research agenda rather than to deliver empirical results. We address each major comment below and indicate where we will revise the manuscript to improve clarity of scope and framing.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'future progress in agentic AI will depend as much on system design as on stronger foundation models' is asserted without any scaling curves, ablation results, or quantitative comparisons showing that harness modifications produce gains comparable in magnitude to those from model scaling.

Authors: We agree the claim is presented without new quantitative evidence. The manuscript is a perspective piece that advances the hypothesis to motivate future work; it does not claim to have demonstrated the relative magnitude of gains. We will revise the abstract to state explicitly that the central claim is a forward-looking hypothesis whose validation is part of the proposed research agenda. revision: yes
Referee: [CheetahClaws comparison] Section introducing CheetahClaws and comparisons: the qualitative discussion of CheetahClaws versus Claude Code and OpenClaw contains no performance metrics, trajectory measurements, or controlled variations of harness components (context governance, memory, routing) that would allow assessment of their relative impact.

Authors: The comparisons are deliberately qualitative to illustrate differences in harness architecture and design choices. Controlled quantitative experiments that isolate harness components would constitute a separate empirical study outside the scope of this position paper. We will add an explicit statement that the section offers architectural contrast only and that systematic measurement of component impact belongs to the future benchmark agenda. revision: yes
Referee: [Research agenda] Research agenda paragraph: the proposed harness-level benchmarks (trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, safe evolution) are described but never executed or illustrated with even preliminary data, leaving the assertion that model-centric evaluation is inadequate without direct support.

Authors: The agenda is offered as a community roadmap rather than results obtained in this work. We acknowledge that the critique of model-centric evaluation would be stronger with even a small illustrative example drawn from existing literature. We will revise the section to reference prior observations on long-horizon failure modes and to state clearly that executing the proposed benchmarks is future work. revision: partial

Circularity Check

0 steps flagged

No circularity; conceptual position paper with no derivations or self-referential reductions

full rationale

The paper advances a definitional claim that agent performance emerges from harness interactions and that future progress depends equally on system design, but this is presented as an agenda-setting premise rather than a result derived from equations, fitted parameters, or prior self-citations. No load-bearing steps reduce to inputs by construction; the text contains no mathematical content, no predictions from fitted subsets, and no uniqueness theorems or ansatzes imported via citation. The reference implementation is described qualitatively without ablations or scaling curves that could create fitted-input circularity. The argument is self-contained as a framing exercise and does not rely on any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that performance emerges from harness interactions and on the invented framing of the harness itself; no free parameters or external benchmarks are used.

axioms (1)

domain assumption Agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer.
Invoked in the abstract to argue that model-centric evaluation is inadequate.

invented entities (1)

agent harness no independent evidence
purpose: Structured execution layer around a foundation model treated as first-class object of design, evaluation, and optimization
New term introduced to organize the components listed in the abstract; no independent falsifiable handle supplied.

pith-pipeline@v0.9.1-grok · 5822 in / 1331 out tokens · 26795 ms · 2026-06-29T21:27:33.263379+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Demonstrating chart-plot: Closing the Last Mile of Academic Chart Generation
cs.HC 2026-06 unverdicted novelty 5.0

chart-plot is an agentic harness using style-aware code generation from venue figures, a LaTeX-aware render-and-revise loop, and structured edit handles to produce top-venue-ready academic charts.

Reference graph

Works this paper leans on

46 extracted references · 15 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Remembering more, risking more: Longitudinal safety risks in memory-equipped llm agents, 2026

Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, and Ming Jin. Remembering more, risking more: Longitudinal safety risks in memory-equipped llm agents, 2026

2026
[2]

Manage Claude’s Memory (Claude Code Documentation).https://docs.claude

Anthropic. Manage Claude’s Memory (Claude Code Documentation).https://docs.claude. com/en/docs/claude-code/memory. Documents CLAUDE.md instruction files and auto memory. Accessed: 2026-04-18

2026
[3]

Claude Code

Anthropic. Claude Code. https://claude.com/product/claude-code, 2025. Accessed: 2026-04-02

2025
[4]

Effective Context Engineering for AI Agents

Anthropic. Effective Context Engineering for AI Agents. https://www.anthropic.com/ engineering/effective-context-engineering-for-ai-agents , September 2025. Anthropic Engineering Blog. Accessed: 2026-04-18

2025
[5]

Enabling Claude Code to work more autonomously

Anthropic. Enabling Claude Code to work more autonomously. https://www.anthropic. com/news/enabling-claude-code-to-work-more-autonomously , Sep 2025. Accessed: 2026-04-02

2025
[6]

How we built our multi-agent research system

Anthropic. How we built our multi-agent research system. https://www.anthropic.com/ engineering/multi-agent-research-system, Jun 2025. Accessed: 2026-04-02

2025
[7]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, April 2026. Accessed: 2026-04-01

2026
[8]

Why do multi- agent llm systems fail?Advances in Neural Information Processing Systems, 38, 2026

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?Advances in Neural Information Processing Systems, 38, 2026

2026
[9]

Skills in OpenAI API

Emre Okcular. Skills in OpenAI API. https://developers.openai.com/cookbook/ examples/skills_in_api, Feb 2026. Accessed: 2026-04-02

2026
[10]

Gemini 3.1 Pro: A smarter model for your most complex tasks

Google. Gemini 3.1 Pro: A smarter model for your most complex tasks. https: //blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/, february 2026. Accessed: 2026-04-01

2026
[11]

Long context, less focus: A scaling gap in llms revealed through privacy and personalization.arXiv preprint arXiv:2602.15028, 2026

Shangding Gu. Long context, less focus: A scaling gap in llms revealed through privacy and personalization.arXiv preprint arXiv:2602.15028, 2026

work page arXiv 2026
[12]

LLMs Should Express Uncertainty Explicitly

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, and Javad Lavaei. Llms should express uncertainty explicitly.arXiv preprint arXiv:2604.05306, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024. 12

2024
[14]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Context engineering: memory, compaction, and tool clearing

Isabella He. Context engineering: memory, compaction, and tool clearing. https://platform.claude.com/cookbook/ tool-use-context-engineering-context-engineering-tools , March 2026. Ac- cessed: 2026-04-01

2026
[16]

Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

2024
[17]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter.arXiv preprint arXiv:2407.01502, 2024

work page arXiv 2024
[18]

Using skills to accelerate OSS maintenance

Kazuhiro Sera. Using skills to accelerate OSS maintenance. https://developers.openai. com/blog/skills-agents-sdk, Mar 2026. Accessed: 2026-04-02

2026
[19]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[20]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023
[21]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024
[22]

Agenticpay: A multi-agent llm negotiation system for buyer-seller transactions.arXiv preprint arXiv:2602.06008, 2026

Xianyang Liu, Shangding Gu, and Dawn Song. Agenticpay: A multi-agent llm negotiation system for buyer-seller transactions.arXiv preprint arXiv:2602.06008, 2026

work page arXiv 2026
[23]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024

2024
[24]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, volume 2024, pages 9025–9049, 2024

2024
[26]

Code as Agent Harness

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, et al. Code as agent harness.arXiv preprint arXiv:2605.18747, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026. Accessed: 2026-04-01

2026
[28]

Claude Code Understand how to integrate Claude Code into your development workflows with best practices and real-world examples

OW ASP GenAI Security Project. Claude Code Understand how to integrate Claude Code into your development workflows with best practices and real-world examples. https://genai. owasp.org/resource/agentic-ai-threats-and-mitigations/ , February 2025. Ac- cessed: 2026-04-20

2025
[29]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Harness engineering: leveraging Codex in an agent-first world

Ryan Lopopolo. Harness engineering: leveraging Codex in an agent-first world. https: //openai.com/index/harness-engineering/, Feb 2026. Accessed: 2026-04-02

2026
[31]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

2023
[32]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[33]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

2022
[34]

Openclaw — personal ai assistant.github, 2026

Openclaw Team. Openclaw — personal ai assistant.github, 2026

2026
[35]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[37]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint arXiv:2302.11382, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024
[39]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

2025
[40]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[41]

Agentic web: Weaving the next web with ai agents.arXiv preprint arXiv:2507.21206, 2025

Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, et al. Agentic web: Weaving the next web with ai agents.arXiv preprint arXiv:2507.21206, 2025

work page arXiv 2025
[42]

Understanding agent scaling in llm-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, and Shangding Gu. Understanding agent scaling in llm-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

work page arXiv 2026
[43]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

X-mas: Towards building multi-agent systems with heterogeneous llms.arXiv preprint arXiv:2505.16997, 2025

Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, and Siheng Chen. X-mas: Towards building multi-agent systems with heterogeneous llms.arXiv preprint arXiv:2505.16997, 2025

work page arXiv 2025
[46]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024. 14

2024

[1] [1]

Remembering more, risking more: Longitudinal safety risks in memory-equipped llm agents, 2026

Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, and Ming Jin. Remembering more, risking more: Longitudinal safety risks in memory-equipped llm agents, 2026

2026

[2] [2]

Manage Claude’s Memory (Claude Code Documentation).https://docs.claude

Anthropic. Manage Claude’s Memory (Claude Code Documentation).https://docs.claude. com/en/docs/claude-code/memory. Documents CLAUDE.md instruction files and auto memory. Accessed: 2026-04-18

2026

[3] [3]

Claude Code

Anthropic. Claude Code. https://claude.com/product/claude-code, 2025. Accessed: 2026-04-02

2025

[4] [4]

Effective Context Engineering for AI Agents

Anthropic. Effective Context Engineering for AI Agents. https://www.anthropic.com/ engineering/effective-context-engineering-for-ai-agents , September 2025. Anthropic Engineering Blog. Accessed: 2026-04-18

2025

[5] [5]

Enabling Claude Code to work more autonomously

Anthropic. Enabling Claude Code to work more autonomously. https://www.anthropic. com/news/enabling-claude-code-to-work-more-autonomously , Sep 2025. Accessed: 2026-04-02

2025

[6] [6]

How we built our multi-agent research system

Anthropic. How we built our multi-agent research system. https://www.anthropic.com/ engineering/multi-agent-research-system, Jun 2025. Accessed: 2026-04-02

2025

[7] [7]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, April 2026. Accessed: 2026-04-01

2026

[8] [8]

Why do multi- agent llm systems fail?Advances in Neural Information Processing Systems, 38, 2026

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. Why do multi- agent llm systems fail?Advances in Neural Information Processing Systems, 38, 2026

2026

[9] [9]

Skills in OpenAI API

Emre Okcular. Skills in OpenAI API. https://developers.openai.com/cookbook/ examples/skills_in_api, Feb 2026. Accessed: 2026-04-02

2026

[10] [10]

Gemini 3.1 Pro: A smarter model for your most complex tasks

Google. Gemini 3.1 Pro: A smarter model for your most complex tasks. https: //blog.google/innovation-and-ai/models-and-research/gemini-models/ gemini-3-1-pro/, february 2026. Accessed: 2026-04-01

2026

[11] [11]

Long context, less focus: A scaling gap in llms revealed through privacy and personalization.arXiv preprint arXiv:2602.15028, 2026

Shangding Gu. Long context, less focus: A scaling gap in llms revealed through privacy and personalization.arXiv preprint arXiv:2602.15028, 2026

work page arXiv 2026

[12] [12]

LLMs Should Express Uncertainty Explicitly

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, and Javad Lavaei. Llms should express uncertainty explicitly.arXiv preprint arXiv:2604.05306, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024. 12

2024

[14] [14]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Context engineering: memory, compaction, and tool clearing

Isabella He. Context engineering: memory, compaction, and tool clearing. https://platform.claude.com/cookbook/ tool-use-context-engineering-context-engineering-tools , March 2026. Ac- cessed: 2026-04-01

2026

[16] [16]

Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

2024

[17] [17]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter.arXiv preprint arXiv:2407.01502, 2024

work page arXiv 2024

[18] [18]

Using skills to accelerate OSS maintenance

Kazuhiro Sera. Using skills to accelerate OSS maintenance. https://developers.openai. com/blog/skills-agents-sdk, Mar 2026. Accessed: 2026-04-02

2026

[19] [19]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020

[20] [20]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

2023

[21] [21]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024

[22] [22]

Agenticpay: A multi-agent llm negotiation system for buyer-seller transactions.arXiv preprint arXiv:2602.06008, 2026

Xianyang Liu, Shangding Gu, and Dawn Song. Agenticpay: A multi-agent llm negotiation system for buyer-seller transactions.arXiv preprint arXiv:2602.06008, 2026

work page arXiv 2026

[23] [23]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024

2024

[24] [24]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, volume 2024, pages 9025–9049, 2024

2024

[26] [26]

Code as Agent Harness

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, et al. Code as agent harness.arXiv preprint arXiv:2605.18747, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026. Accessed: 2026-04-01

2026

[28] [28]

Claude Code Understand how to integrate Claude Code into your development workflows with best practices and real-world examples

OW ASP GenAI Security Project. Claude Code Understand how to integrate Claude Code into your development workflows with best practices and real-world examples. https://genai. owasp.org/resource/agentic-ai-threats-and-mitigations/ , February 2025. Ac- cessed: 2026-04-20

2025

[29] [29]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Harness engineering: leveraging Codex in an agent-first world

Ryan Lopopolo. Harness engineering: leveraging Codex in an agent-first world. https: //openai.com/index/harness-engineering/, Feb 2026. Accessed: 2026-04-02

2026

[31] [31]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36:68539– 68551, 2023

2023

[32] [32]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[33] [33]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

2022

[34] [34]

Openclaw — personal ai assistant.github, 2026

Openclaw Team. Openclaw — personal ai assistant.github, 2026

2026

[35] [35]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[37] [37]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint arXiv:2302.11382, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024

[39] [39]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

2025

[40] [40]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024

[41] [41]

Agentic web: Weaving the next web with ai agents.arXiv preprint arXiv:2507.21206, 2025

Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, et al. Agentic web: Weaving the next web with ai agents.arXiv preprint arXiv:2507.21206, 2025

work page arXiv 2025

[42] [42]

Understanding agent scaling in llm-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, and Shangding Gu. Understanding agent scaling in llm-based multi-agent systems via diversity.arXiv preprint arXiv:2602.03794, 2026

work page arXiv 2026

[43] [43]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

X-mas: Towards building multi-agent systems with heterogeneous llms.arXiv preprint arXiv:2505.16997, 2025

Rui Ye, Xiangrui Liu, Qimin Wu, Xianghe Pang, Zhenfei Yin, Lei Bai, and Siheng Chen. X-mas: Towards building multi-agent systems with heterogeneous llms.arXiv preprint arXiv:2505.16997, 2025

work page arXiv 2025

[46] [46]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024. 14

2024