arxiv: 2604.25850 · v3 · submitted 2026-04-28 · 💻 cs.CL · cs.SE

Recognition: unknown

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin , Shichun Liu , Chengjun Pan , Lizhi Lin , Shihan Dou , Xuanjing Huang , Hang Yan , Zhenhua Han

show 1 more author

Tao Gui

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:12 UTC · model grok-4.3

classification 💻 cs.CL cs.SE

keywords Agentic Harness Engineeringcoding agentsobservabilityautomatic evolutionharness engineeringTerminal-BenchSWE-benchagent performance

0 comments

The pith

Three observability pillars let coding-agent harnesses evolve autonomously to beat human designs and transfer across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agentic Harness Engineering as a closed-loop process that automates improvements to the harnesses mediating how coding models use tools and environments. It addresses the problems of complex edit spaces, buried trajectory signals, and hard-to-attribute changes by creating three matched observability structures that make components explicit, distill experiences into usable layers, and link each edit to a verifiable prediction. A reader would care if this turns harness design from repeated manual tuning into a repeatable, falsifiable process that produces stronger and more reusable agent setups. The reported results include a rise in pass@1 from 69.7% to 77.0% on Terminal-Bench 2, outpacing both a human-designed baseline and prior self-evolving methods, plus successful transfer to another benchmark at lower token cost.

Core claim

Agentic Harness Engineering turns harness evolution into an autonomous loop by giving every editable component a file-level representation, distilling raw trajectories into a drill-down evidence corpus, and pairing each edit with a self-declared prediction that is checked against later task outcomes. Ten iterations of this loop raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, exceeding the human-designed Codex-CLI harness at 71.9% and the self-evolving baselines. The resulting frozen harness transfers to SWE-bench-verified with 12% fewer tokens than the seed and delivers +5.1 to +10.1 percentage-point gains across three alternate model families on Terminal-Bench 2, while ablations show

What carries the argument

The three matched observability pillars that render harness components as explicit file-level objects, compress trajectories into layered evidence, and enforce prediction-then-verification contracts on every edit.

If this is right

The evolved harness can be frozen and reused on new tasks without further evolution while still showing gains.
Performance improvements localize to tools, middleware, and long-term memory components rather than the system prompt.
Cross-model gains appear on three alternate families, indicating the changes capture reusable engineering patterns.
Token usage drops on transferred tasks, showing efficiency as a side benefit of the evolved structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distinction between transferable structural edits and non-transferable prompt edits suggests future evolution loops should prioritize component and memory changes over prose strategy.
The same observability approach could be tested on agent harnesses for non-coding domains such as data analysis or web navigation.
If the pillars scale to larger action spaces, they might reduce the need for human oversight in other automated agent-improvement pipelines.

Load-bearing premise

The three observability pillars give enough structure and signal that the evolution loop produces general improvements rather than noise-driven or benchmark-specific changes.

What would settle it

Apply the final evolved harness to a fresh coding benchmark family outside Terminal-Bench and SWE-bench; if pass rates show no lift over the seed harness, the claim of generalizable evolution would fail.

Figures

Figures reproduced from arXiv: 2604.25850 by Chengjun Pan, Hang Yan, Jiahang Lin, Lizhi Lin, Shichun Liu, Shihan Dou, Tao Gui, Xuanjing Huang, Zhenhua Han.

**Figure 1.** Figure 1: AHE evolves a bash-only seed past every human-designed and self-evolving baseline on Terminal-Bench 2. All three role agents share one base model, isolating the gain to harness edits rather than analyzer or editor capability. Harness design materially shifts task completion on long-horizon coding benchmarks, even with the base model held fixed [40, 42], making harness engineering a first-class lever for im… view at source ↗

**Figure 2.** Figure 2: The AHE pipeline links three observable surfaces into one closed loop. Components, rollout experience, and edit decisions each surface as structured artifacts another agent reads, and every edit becomes a falsifiable prediction the next round verifies. Three observability layers implement this principle. Component observability (§3.1) is realized by a decoupled, file-level harness substrate that maps each … view at source ↗

**Figure 3.** Figure 3: Cross-model transfer on Terminal-Bench 2, 89 tasks. The AHE workspace evolved on view at source ↗

**Figure 4.** Figure 4: Cross-iteration mean precision and recall of the evolve model’s self-predictions across 9 view at source ↗

**Figure 5.** Figure 5: Three-column trajectory comparison for db-wal-recovery before and after chg-1. Both rollouts share the same random seed and the same first three steps S1 to S3, summarized in the banner above the columns. The left column lists the four divergence steps F1 to F4 of the failing rollout. The middle column lists the four chg-1 rules out of eight that fire on this trajectory, each annotated with the failure ste… view at source ↗

**Figure 6.** Figure 6: Three-column trajectory comparison for mcmc-sampling-stan before and after the two harness changes shipped at the start of iteration 6: the tool-level publish-state guard chg-1 at commit ff0cf3d and the middleware-level execution-risk hints chg-2 at commit 9651986, whose full manifest entry appears in view at source ↗

**Figure 7.** Figure 7: Two change-manifest entries written in iteration 1, one editing the system prompt and one view at source ↗

**Figure 8.** Figure 8: The two change-manifest entries written together at the iteration-4 boundary and shipped as view at source ↗

**Figure 9.** Figure 9: The two change-manifest entries shipped as the iteration-6 harness. view at source ↗

**Figure 10.** Figure 10: Two change-manifest entries written together at the iteration-7 boundary and shipped view at source ↗

**Figure 11.** Figure 11: Per-round fix predictions. Left: precision. Right: recall. Bars decompose each denominator view at source ↗

**Figure 12.** Figure 12: Per-round regression predictions. Left: precision. Right: recall. Same encoding as Fig. view at source ↗

read the original abstract

Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable structure for automating harness edits via three observability pillars and shows benchmark gains, but the transfer claims rest on single-trajectory results without variance checks.

read the letter

The main takeaway is that this work turns harness evolution into a more controlled process by requiring the agent to declare predictions for each change and then verify them against outcomes. That decision observability, paired with explicit component representations and distilled trajectory evidence, is the clearest new piece. It moves beyond generic self-evolution loops by making edits falsifiable contracts rather than open-ended trial and error. The ablations that tie gains to tools, middleware, and memory instead of the system prompt are also useful; they give a concrete signal about what actually transfers when the harness is frozen and reused on new models or benchmarks. The reported lift from 69.7% to 77.0% on Terminal-Bench 2 and the cross-family gains of 5–10 points are the empirical hook, and the token savings on SWE-bench-verified add a practical angle. Those numbers suggest the method can produce reusable structure in at least some cases. The soft spots are mostly around statistical grounding. No error bars, run counts, or significance tests appear in the reported results, so it is hard to know whether the 7.3-point improvement sits outside normal variation of the seed harness. The SWE-bench transfer is described only as “tops aggregate success at 12% fewer tokens” without the raw success-rate delta, which leaves the generalization claim partly interpretive. A control that applies the same number of edits without the layered evidence or self-prediction step would have made the pillars’ causal role clearer. Readers working on coding-agent deployment will get the most from this; the setup is concrete enough to try on their own harnesses. It deserves a serious referee because the problem is real and the framing is actionable, even if the current experiments need tighter controls and more runs to support the broader claims. I would send it for review with a request for variance data and exact transfer metrics.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agentic Harness Engineering (AHE), a closed-loop system for automatically evolving coding-agent harnesses via three observability pillars—component observability (explicit file-level editable components), experience observability (distilled layered evidence from trajectories), and decision observability (self-predicted edits verified against outcomes). It claims that 10 AHE iterations raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0% (surpassing Codex-CLI at 71.9% and baselines ACE/TF-GRPO), with the frozen evolved harness transferring to SWE-bench-verified (higher aggregate success at 12% fewer tokens) and yielding +5.1 to +10.1pp gains across three alternate model families on Terminal-Bench 2; ablations attribute gains to tools/middleware/memory rather than prompts.

Significance. If the central claims hold under rigorous controls, the work would be a meaningful advance in automating harness design for coding agents, a currently manual process. The transfer results without re-evolution and the localization of gains to structural components (rather than prose) suggest the method can produce reusable engineering knowledge. The decision-observability mechanism for turning edits into falsifiable contracts is a conceptual strength that could generalize beyond the reported benchmarks.

major comments (3)

[Empirical evaluation] Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.
[Ablation studies] Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.
[Transfer experiments] Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.

minor comments (2)

[Abstract and Methods] The abstract and methods could more explicitly define how an “iteration” is counted and what constitutes a single edit within the closed loop.
[Introduction and Methodology] Notation for the three pillars would benefit from consistent acronym usage or a summary table to improve readability when referring back to them in later sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical evaluation, ablation design, and transfer results. These comments highlight areas where we can improve rigor and clarity. We respond point by point below and commit to revisions.

read point-by-point responses

Referee: Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.

Authors: We acknowledge that the primary results are reported from a single evolution run without error bars or significance tests. Each full AHE iteration incurs substantial compute for trajectory collection and evaluation across the benchmark, which constrained the initial experiments to one run. We will add error bars derived from repeated evaluations of the final harness, report results from one additional independent evolution run, and include a basic statistical comparison in the revised manuscript to address run-to-run variance. revision: yes
Referee: Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.

Authors: We agree that the current ablations, which remove individual pillars, do not fully isolate the contribution of the observability mechanisms from the mere act of performing edits. A control applying an equivalent number of edits without component, experience, and decision observability would strengthen the causal argument. We will add this baseline in the revision, comparing AHE-guided evolution against random or heuristic edits of matching volume, to demonstrate that the pillars are necessary for the observed gains. revision: yes
Referee: Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.

Authors: We will expand the transfer section to report the exact aggregate success rates on SWE-bench-verified for the seed and evolved harness, include variance or confidence intervals, and add a per-task breakdown table. This will quantify the improvement more precisely and better support the interpretation that the evolved components capture reusable engineering knowledge rather than benchmark-specific tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the AHE derivation

full rationale

The paper describes an empirical closed-loop evolution process driven by three observability pillars that convert edits into verifiable predictions against task outcomes. Performance claims rest on reported benchmark lifts (Terminal-Bench 2, SWE-bench-verified) and cross-model transfer rather than any equation or definition that reduces to its own inputs by construction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or rename known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The approach rests on the assumption that harness components admit clean file-level representations and that distilled trajectory summaries contain sufficient signal for autonomous decision-making; no explicit free parameters or invented physical entities are stated.

axioms (2)

domain assumption Harness components can be represented at file level in a way that makes the action space explicit and revertible.
Invoked in the description of component observability pillar.
domain assumption Millions of raw trajectory tokens can be distilled into a layered evidence corpus that an evolving agent can consume effectively.
Core premise of experience observability.

invented entities (3)

Component observability pillar no independent evidence
purpose: Makes every editable harness component a file-level representation for explicit and revertible actions.
New conceptual construct introduced to address heterogeneous action space.
Experience observability pillar no independent evidence
purpose: Distills voluminous trajectories into drill-down evidence corpus.
New conceptual construct to handle signal burial in trajectories.
Decision observability pillar no independent evidence
purpose: Pairs every edit with a self-declared prediction verified against outcomes.
New conceptual construct to turn edits into falsifiable contracts.

pith-pipeline@v0.9.0 · 5644 in / 1729 out tokens · 71405 ms · 2026-05-07T16:12:58.518385+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

117 extracted references · 16 canonical work pages · 10 internal anchors

[1]

Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. InThe F ourteenth Internatio...

2025
[2]

Opencode: The open source coding agent., 2025

Anomaly. Opencode: The open source coding agent., 2025. URL https://github.com/ anomalyco/opencode

2025
[3]

Claude-code, 2025

Anthropic. Claude-code, 2025. URLhttps://github.com/anthropics/claude-code

2025
[4]

Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191,

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, and Xing Sun. Training-free group relative policy optimization, October 2025. URLhttp://arxiv.org/abs/2510.08191

work page arXiv 2025
[5]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, October 2024. URL https://ope...

2024
[6]

Deepseek-v4: Towards highly efficient million-token context intelligence, April

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, April
[7]

URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf
[8]

He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R

Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R. Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? October 2025. URL...

2025
[9]

Gemini-3-1-flash-lite-model-card, March 2026

Google. Gemini-3-1-flash-lite-model-card, March 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf

2026
[10]

Critiq: Mining data quality criteria from human preferences

Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, and Tao Gui. Critiq: Mining data quality criteria from human preferences. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational...

work page doi:10.18653/v1/2025.acl-long.792 2025
[11]

Terminus-2, 2026

Harbor. Terminus-2, 2026. URL https://www.harborframework.com/docs/agents/ terminus-2

2026
[12]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=t9U3LW7JVX

2024
[13]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/forum?id= chfJJYC3iL

2024
[14]

R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents

Naman Jain, Jaskirat Singh, Manish Shetty, Tianjun Zhang, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents. InSecond Conference on Language Modeling, August 2025. URL https://openreview.net/forum?id=7evvwwdo3z#discussion

2025
[15]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=VTF8yNQM66

2023
[16]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, October 2023. URLhttp://arxiv.org/abs/2310.03714

work page internal anchor Pith review arXiv 2023
[17]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, March 2026. URL http: //arxiv.org/abs/2603.28052

work page internal anchor Pith review arXiv 2026
[18]

Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026

Lizhi Lin. Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026. URLhttps://dawning-road.github.io/blog/agent-debugger

2026
[19]

Harness engineering: Leveraging codex in an agent-first world, February 2026

Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world, February 2026. URLhttps://openai.com/zh-Hans-CN/index/harness-engineering/

2026
[20]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, April 2026. URLhttp://arxiv.org/abs/2604.08377

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InThirty-Seventh Conference on Neural Infor- m...

2023
[22]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review arXiv 2026
[23]

Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025

Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/forum?id=xZXhFg43EI

2025
[24]

Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025

Nex-AGI. Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025. URL https://github.com/nex-agi/NexAU

2025
[25]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review arXiv 2025
[26]

Codex cli, 2025

OpenAI. Codex cli, 2025. URLhttps://developers.openai.com/codex/cli

2025
[27]

Introducing gpt-5.4, March 2026

OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026
[28]

Optimizing instructions and demonstrations for multi-stage language model programs

Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...

work page doi:10.18653/v1/2024.emnlp-main.525 2024
[29]

Training software engineering agents and verifiers with swe-gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/ forum?id=Cq1BNvHx74

2025
[30]

Harness design for long-running application develop- ment, March 2026

Prithvi Rajasekaran. Harness design for long-running application develop- ment, March 2026. URL https://www.anthropic.com/engineering/ harness-design-long-running-apps

2026
[31]

Effective context engineering for ai agents, September 2025

Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield, Rafi Ayub, Hannah Moran, Cal Rueb, Connor Jennings, Molly V orwerck, Stuart Ritchie, and Maggie V o. Effective context engineering for ai agents, September 2025. URL https://www.anthropic.com/ engineering/effective-context-engineering-for-ai-agents

2025
[32]

Hermes agent — the agent that grows with you, 2026

Nous Research. Hermes agent — the agent that grows with you, 2026. URL https:// hermes-agent.nousresearch.com/. 12

2026
[33]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-Seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview. net/forum?id=vAElhFcKW6

2023
[34]

Openclaw — personal ai assistant, February 2026

Peter Steinberger. Openclaw — personal ai assistant, February 2026. URL https://openclaw. ai/

2026
[35]

The bitter lesson, March 2019

Rich Sutton. The bitter lesson, March 2019. URL https://www.cs.utexas.edu/~eunsol/ courses/data/bitter_lesson.pdf

2019
[36]

Kimi k2.6 tech blog: Advancing open-source coding, April 2026

Kimi Team. Kimi k2.6 tech blog: Advancing open-source coding, April 2026. URL https: //www.kimi.com/blog/kimi-k2-6

2026
[37]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page internal anchor Pith review arXiv 2026
[38]

Nex- n1: Agentic models trained via a unified ecosystem for large-scale environment construction, December 2025

Nex-AGI Team, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun ...

work page arXiv 2025
[39]

Qwen3.6-plus: Towards real world agents, April 2026

Qwen Team. Qwen3.6-plus: Towards real world agents, April 2026. URL https://qwenlm. github.io/blog/qwen3.6/

2026
[40]

Mimo-v2.5-pro, April 2026

Xiaomi MiMo Team. Mimo-v2.5-pro, April 2026. URL https://huggingface.co/ XiaomiMiMo/MiMo-V2.5-Pro

2026
[41]

Improving deep agents with harness engineering, February 2026

Vivek Trivedy. Improving deep agents with harness engineering, February 2026. URLhttps:// www.langchain.com/blog/improving-deep-agents-with-harness-engineering

2026
[42]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, October 2023. URLhttp://arxiv.org/abs/2305.16291

work page internal anchor Pith review arXiv 2023
[43]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page internal anchor Pith review arXiv 2025
[44]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, February 2026. URL http://arxiv.org/abs/2602.08234

work page internal anchor Pith review arXiv 2026
[45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review arXiv 2025
[46]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. Swe-agent: Agent-computer inter- faces enable automated software engineering. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems, November 2024. URL https://openreview.net/forum?id=mXpq6ut8J3&referrer=%5Bthe%20profi...

2024
[47]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In The Thirteenth International Conference on Learning Representations, October 2024. URL ...

2024
[48]

Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026

Yucheng Zeng, Shupeng Li, Daxiang Dong, Ruijie Xu, Zimo Chen, Liwei Zheng, Yuxuan Li, Zhe Zhou, Haotian Zhao, Lun Tian, Heng Xiao, Tianshu Zhu, Longkun Hao, and Jianmin Wu. Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026. URLhttp://arxiv.org/abs/2603.00575

work page arXiv 2026
[49]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/ forum?id=z5uVAKwmjf

2024
[50]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe F ourteenth International Conference on Learning Representations, October 2025. URL...

2025
[51]

Expel: Llm agents are experiential learners, December 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, December 2024. URL http://arxiv.org/abs/2308. 10144

2024
[52]

Symbolic learning enables self-evolving agents

Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, and Yuchen Eleanor Jiang. Symbolic learning enables self-evolving agents, June 2024. URL http://arxiv.org/abs/ 2406.18532

work page arXiv 2024
[53]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...

2024
[54]

workspace

Gregor Zunic. The bitter lesson of agent harnesses, April 2026. URL https://browser-use. com/posts/bitter-lesson-agent-harnesses. A Experimental Setup: Full Details This appendix expands the condensed Setup in §4.1 with the formal metric definitions and the runtime infrastructure. Seed agent.The seed configuration, denotedNexAU 0, is a simple code agent b...

2026
[55]

**Failure evidence** -- which tasks failed, and what specifically went wrong (from analysis reports or traces)
[56]

**Root cause** -- why it failed, not just what failed
[57]

**Targeted fix** -- a change that directly addresses the root cause
[58]

workspace

**Predicted impact** -- which tasks this should fix, and which tasks might be at risk # Environment {% if ws != "workspace" %} > **WORKSPACE PATH**: Your workspace is at`{{ ws }}/`instead of`workspace/`. All`workspace/` references below apply to`{{ ws }}/`. Use`{{ ws }}/`in file operations, git commands, and the validation command. {% endif %} > **Loop co...
[59]

Read`evolution_history.md`-- understand what's been tried, what worked, what failed
[60]

**Read`runs/iteration_NNN/input/analysis/overview.md`FIRST** -- this is your primary information source
[61]

**Read`runs/iteration_NNN/input/analysis/detail/{task_name}.md`** for tasks needing deeper investigation
[62]

Only fall back to reading raw`nexau_in_memory_tracer.cleaned.json`when analysis is missing or insufficient -- this should be rare
[63]

**After creating or modifying middleware**, read at least one`agent/nexau.txt`from a failed task -- it contains runtime logs (middleware init errors, warnings, crashes) that static validation cannot catch
[64]

Group failures into **pattern classes** -- each pattern = a class of failures, not individual tasks
[65]

For each pattern, identify the **root cause** and choose the most appropriate fix -- could be prompt, tool, middleware, or any component 19
[66]

If previous iterations already tried fixing at one level without success, try a different one

**Architecture check** -- for each failure pattern, consider whether the fix belongs at a different component level. If previous iterations already tried fixing at one level without success, try a different one
[67]

chg-N: <short description>

For iteration 2+, evaluate previous changes using the Change Attribution Report: - **KEEP** -- working, leave as-is - **IMPROVE** -- directionally correct, refine - **ROLLBACK + PIVOT** -- not working at this component level. Rollback the change, then re- approach the same failure pattern from a **different component level** **The sole optimization target...
[68]

**How to write middleware** -- base class, hook methods, params, registration, real examples from source
[69]

**How to create tools** -- YAML schema, Python function signature, binding, agent_state injection
[70]

**How to create skills** -- SKILL.md format, frontmatter, registration, loading mechanism
[71]

**How to create sub-agents** -- config schema, registration, invocation, context isolation
[72]

**YAML config schema** -- complete field reference with types, defaults, required/optional
[73]

Do NOT spend all your time reading

**Key runtime behaviors** -- only what's needed to write correct components # Source Code Location (READ ONLY) - NexAU framework:`{{ nexau_path }}` # Output Directory (WRITE) - Skill file:`{{ output_skill_dir }}/nexau-framework-internals/SKILL.md` # [!] MANDATORY WORKFLOW: Explore-Write-Refine Cycles You MUST follow this phased workflow. Do NOT spend all ...
[74]

Read key files: config dataclasses, hooks.py base class, existing middleware/tool implementations
[75]

**WRITE the initial SKILL.md** with whatever you have -- even if incomplete, use "[TODO]" placeholders ## Phase 2: Practical Patterns (iterations 16-60)
[76]

For each section below, find **real code examples** from the source
[77]

**After each section, immediately`write_file`to UPDATE SKILL.md**
[78]

Priority order: section 1 Config -> section 2 Middleware -> section 3 Tools -> section 4 Skills -> section 5 Sub-Agents -> section 6 Runtime ## Phase 3: Polish & Complete (iterations 61-80)
[79]

Fill remaining "[TODO]" sections, add copy-paste templates
[80]

No exceptions

Call`complete_task` **HARD RULES:** - You MUST call`write_file`for SKILL.md **before iteration 20**. No exceptions. - You MUST call`write_file`to update SKILL.md **at least every 15 iterations** after that. - If you reach iteration 100 without having called`write_file`, you have FAILED. - Use`read_file`with offset/limit for large files. - Cite`file:line_r...

Showing first 80 references.