Recognition: unknown
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Pith reviewed 2026-05-07 16:12 UTC · model grok-4.3
The pith
Three observability pillars let coding-agent harnesses evolve autonomously to beat human designs and transfer across benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentic Harness Engineering turns harness evolution into an autonomous loop by giving every editable component a file-level representation, distilling raw trajectories into a drill-down evidence corpus, and pairing each edit with a self-declared prediction that is checked against later task outcomes. Ten iterations of this loop raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, exceeding the human-designed Codex-CLI harness at 71.9% and the self-evolving baselines. The resulting frozen harness transfers to SWE-bench-verified with 12% fewer tokens than the seed and delivers +5.1 to +10.1 percentage-point gains across three alternate model families on Terminal-Bench 2, while ablations show
What carries the argument
The three matched observability pillars that render harness components as explicit file-level objects, compress trajectories into layered evidence, and enforce prediction-then-verification contracts on every edit.
If this is right
- The evolved harness can be frozen and reused on new tasks without further evolution while still showing gains.
- Performance improvements localize to tools, middleware, and long-term memory components rather than the system prompt.
- Cross-model gains appear on three alternate families, indicating the changes capture reusable engineering patterns.
- Token usage drops on transferred tasks, showing efficiency as a side benefit of the evolved structure.
Where Pith is reading between the lines
- The distinction between transferable structural edits and non-transferable prompt edits suggests future evolution loops should prioritize component and memory changes over prose strategy.
- The same observability approach could be tested on agent harnesses for non-coding domains such as data analysis or web navigation.
- If the pillars scale to larger action spaces, they might reduce the need for human oversight in other automated agent-improvement pipelines.
Load-bearing premise
The three observability pillars give enough structure and signal that the evolution loop produces general improvements rather than noise-driven or benchmark-specific changes.
What would settle it
Apply the final evolved harness to a fresh coding benchmark family outside Terminal-Bench and SWE-bench; if pass rates show no lift over the seed harness, the claim of generalizable evolution would fail.
Figures
read the original abstract
Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agentic Harness Engineering (AHE), a closed-loop system for automatically evolving coding-agent harnesses via three observability pillars—component observability (explicit file-level editable components), experience observability (distilled layered evidence from trajectories), and decision observability (self-predicted edits verified against outcomes). It claims that 10 AHE iterations raise pass@1 on Terminal-Bench 2 from 69.7% to 77.0% (surpassing Codex-CLI at 71.9% and baselines ACE/TF-GRPO), with the frozen evolved harness transferring to SWE-bench-verified (higher aggregate success at 12% fewer tokens) and yielding +5.1 to +10.1pp gains across three alternate model families on Terminal-Bench 2; ablations attribute gains to tools/middleware/memory rather than prompts.
Significance. If the central claims hold under rigorous controls, the work would be a meaningful advance in automating harness design for coding agents, a currently manual process. The transfer results without re-evolution and the localization of gains to structural components (rather than prose) suggest the method can produce reusable engineering knowledge. The decision-observability mechanism for turning edits into falsifiable contracts is a conceptual strength that could generalize beyond the reported benchmarks.
major comments (3)
- [Empirical evaluation] Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.
- [Ablation studies] Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.
- [Transfer experiments] Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.
minor comments (2)
- [Abstract and Methods] The abstract and methods could more explicitly define how an “iteration” is counted and what constitutes a single edit within the closed loop.
- [Introduction and Methodology] Notation for the three pillars would benefit from consistent acronym usage or a summary table to improve readability when referring back to them in later sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the empirical evaluation, ablation design, and transfer results. These comments highlight areas where we can improve rigor and clarity. We respond point by point below and commit to revisions.
read point-by-point responses
-
Referee: Empirical evaluation (results reporting the 7.3pp lift): the +7.3pp pass@1 improvement on Terminal-Bench 2 and the cross-model gains are presented without error bars, the number of independent evolution runs, or statistical significance tests, so it is impossible to determine whether the deltas exceed run-to-run variance of the seed harness.
Authors: We acknowledge that the primary results are reported from a single evolution run without error bars or significance tests. Each full AHE iteration incurs substantial compute for trajectory collection and evaluation across the benchmark, which constrained the initial experiments to one run. We will add error bars derived from repeated evaluations of the final harness, report results from one additional independent evolution run, and include a basic statistical comparison in the revised manuscript to address run-to-run variance. revision: yes
-
Referee: Ablation studies: gains are localized to tools, middleware, and long-term memory, yet no control condition is reported that applies an equivalent number of edits without the three observability pillars or the self-prediction verification step; without this, the causal contribution of the pillars to generalizable structure remains unproven.
Authors: We agree that the current ablations, which remove individual pillars, do not fully isolate the contribution of the observability mechanisms from the mere act of performing edits. A control applying an equivalent number of edits without component, experience, and decision observability would strengthen the causal argument. We will add this baseline in the revision, comparing AHE-guided evolution against random or heuristic edits of matching volume, to demonstrate that the pillars are necessary for the observed gains. revision: yes
-
Referee: Transfer experiments: the SWE-bench-verified result is described only as “tops aggregate success at 12% fewer tokens” with no exact success-rate delta, variance, or per-task breakdown, weakening the claim that the evolved harness encodes benchmark-independent engineering experience.
Authors: We will expand the transfer section to report the exact aggregate success rates on SWE-bench-verified for the seed and evolved harness, include variance or confidence intervals, and add a per-task breakdown table. This will quantify the improvement more precisely and better support the interpretation that the evolved components capture reusable engineering knowledge rather than benchmark-specific tuning. revision: yes
Circularity Check
No significant circularity in the AHE derivation
full rationale
The paper describes an empirical closed-loop evolution process driven by three observability pillars that convert edits into verifiable predictions against task outcomes. Performance claims rest on reported benchmark lifts (Terminal-Bench 2, SWE-bench-verified) and cross-model transfer rather than any equation or definition that reduces to its own inputs by construction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or rename known results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Harness components can be represented at file level in a way that makes the action space explicit and revertible.
- domain assumption Millions of raw trajectory tokens can be distilled into a layered evidence corpus that an evolving agent can consume effectively.
invented entities (3)
-
Component observability pillar
no independent evidence
-
Experience observability pillar
no independent evidence
-
Decision observability pillar
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning. InThe F ourteenth Internatio...
2025
-
[2]
Opencode: The open source coding agent., 2025
Anomaly. Opencode: The open source coding agent., 2025. URL https://github.com/ anomalyco/opencode
2025
-
[3]
Claude-code, 2025
Anthropic. Claude-code, 2025. URLhttps://github.com/anthropics/claude-code
2025
-
[4]
Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191,
Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, and Xing Sun. Training-free group relative policy optimization, October 2025. URLhttp://arxiv.org/abs/2510.08191
-
[5]
Mle-bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. Mle-bench: Evaluating machine learning agents on machine learning engineering. In The Thirteenth International Conference on Learning Representations, October 2024. URL https://ope...
2024
-
[6]
Deepseek-v4: Towards highly efficient million-token context intelligence, April
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, April
-
[7]
URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf
-
[8]
He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R
Xiang Deng, Jeff Da, Edwin Pan, Yannis Y . He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa R. Kundurthy, Sean M. Hendryx, Zifan Wang, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? October 2025. URL...
2025
-
[9]
Gemini-3-1-flash-lite-model-card, March 2026
Google. Gemini-3-1-flash-lite-model-card, March 2026. URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Flash-Lite-Model-Card.pdf
2026
-
[10]
Critiq: Mining data quality criteria from human preferences
Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, and Tao Gui. Critiq: Mining data quality criteria from human preferences. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile- hvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational...
-
[11]
Terminus-2, 2026
Harbor. Terminus-2, 2026. URL https://www.harborframework.com/docs/agents/ terminus-2
2026
-
[12]
Automated design of agentic systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=t9U3LW7JVX
2024
-
[13]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/forum?id= chfJJYC3iL
2024
-
[14]
R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents
Naman Jain, Jaskirat Singh, Manish Shetty, Tianjun Zhang, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environment generation and hybrid verifiers for scaling open-weights swe agents. InSecond Conference on Language Modeling, August 2025. URL https://openreview.net/forum?id=7evvwwdo3z#discussion
2025
-
[15]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=VTF8yNQM66
2023
-
[16]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, October 2023. URLhttp://arxiv.org/abs/2310.03714
work page internal anchor Pith review arXiv 2023
-
[17]
Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, March 2026. URL http: //arxiv.org/abs/2603.28052
work page internal anchor Pith review arXiv 2026
-
[18]
Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026
Lizhi Lin. Agent debugger: Understanding agent trajectory with agentic workflows - dawning road, February 2026. URLhttps://dawning-road.github.io/blog/agent-debugger
2026
-
[19]
Harness engineering: Leveraging codex in an agent-first world, February 2026
Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world, February 2026. URLhttps://openai.com/zh-Hans-CN/index/harness-engineering/
2026
-
[20]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, April 2026. URLhttp://arxiv.org/abs/2604.08377
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Self- refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InThirty-Seventh Conference on Neural Infor- m...
2023
-
[22]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...
work page internal anchor Pith review arXiv 2026
-
[23]
Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025
Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe- lancer: Can frontier llms earn $1 million from real-world freelance software engineer- ing? InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/forum?id=xZXhFg43EI
2025
-
[24]
Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025
Nex-AGI. Nexau (au for agent universe), a general-purpose agent framework for building intelligent agents with tool capabilities., 2025. URL https://github.com/nex-agi/NexAU
2025
-
[25]
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...
work page internal anchor Pith review arXiv 2025
-
[26]
Codex cli, 2025
OpenAI. Codex cli, 2025. URLhttps://developers.openai.com/codex/cli
2025
-
[27]
Introducing gpt-5.4, March 2026
OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/
2026
-
[28]
Optimizing instructions and demonstrations for multi-stage language model programs
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9...
-
[29]
Training software engineering agents and verifiers with swe-gym
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. InF orty-Second International Conference on Machine Learning, June 2025. URL https://openreview.net/ forum?id=Cq1BNvHx74
2025
-
[30]
Harness design for long-running application develop- ment, March 2026
Prithvi Rajasekaran. Harness design for long-running application develop- ment, March 2026. URL https://www.anthropic.com/engineering/ harness-design-long-running-apps
2026
-
[31]
Effective context engineering for ai agents, September 2025
Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield, Rafi Ayub, Hannah Moran, Cal Rueb, Connor Jennings, Molly V orwerck, Stuart Ritchie, and Maggie V o. Effective context engineering for ai agents, September 2025. URL https://www.anthropic.com/ engineering/effective-context-engineering-for-ai-agents
2025
-
[32]
Hermes agent — the agent that grows with you, 2026
Nous Research. Hermes agent — the agent that grows with you, 2026. URL https:// hermes-agent.nousresearch.com/. 12
2026
-
[33]
Narasimhan, and Shunyu Yao
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-Seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview. net/forum?id=vAElhFcKW6
2023
-
[34]
Openclaw — personal ai assistant, February 2026
Peter Steinberger. Openclaw — personal ai assistant, February 2026. URL https://openclaw. ai/
2026
-
[35]
The bitter lesson, March 2019
Rich Sutton. The bitter lesson, March 2019. URL https://www.cs.utexas.edu/~eunsol/ courses/data/bitter_lesson.pdf
2019
-
[36]
Kimi k2.6 tech blog: Advancing open-source coding, April 2026
Kimi Team. Kimi k2.6 tech blog: Advancing open-source coding, April 2026. URL https: //www.kimi.com/blog/kimi-k2-6
2026
-
[37]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...
work page internal anchor Pith review arXiv 2026
-
[38]
Nex-AGI Team, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun ...
-
[39]
Qwen3.6-plus: Towards real world agents, April 2026
Qwen Team. Qwen3.6-plus: Towards real world agents, April 2026. URL https://qwenlm. github.io/blog/qwen3.6/
2026
-
[40]
Mimo-v2.5-pro, April 2026
Xiaomi MiMo Team. Mimo-v2.5-pro, April 2026. URL https://huggingface.co/ XiaomiMiMo/MiMo-V2.5-Pro
2026
-
[41]
Improving deep agents with harness engineering, February 2026
Vivek Trivedy. Improving deep agents with harness engineering, February 2026. URLhttps:// www.langchain.com/blog/improving-deep-agents-with-harness-engineering
2026
-
[42]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, October 2023. URLhttp://arxiv.org/abs/2305.16291
work page internal anchor Pith review arXiv 2023
-
[43]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...
work page internal anchor Pith review arXiv 2025
-
[44]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, February 2026. URL http://arxiv.org/abs/2602.08234
work page internal anchor Pith review arXiv 2026
-
[45]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review arXiv 2025
-
[46]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. Swe-agent: Agent-computer inter- faces enable automated software engineering. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems, November 2024. URL https://openreview.net/forum?id=mXpq6ut8J3&referrer=%5Bthe%20profi...
2024
-
[47]
Jimenez, Alex L
John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains? In The Thirteenth International Conference on Learning Representations, October 2024. URL ...
2024
-
[48]
Yucheng Zeng, Shupeng Li, Daxiang Dong, Ruijie Xu, Zimo Chen, Liwei Zheng, Yuxuan Li, Zhe Zhou, Haotian Zhao, Lun Tian, Heng Xiao, Tianshu Zhu, Longkun Hao, and Jianmin Wu. Swe-hub: A unified production system for scalable, executable software engineering tasks, February 2026. URLhttp://arxiv.org/abs/2603.00575
-
[49]
Aflow: Automating agentic workflow generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/ forum?id=z5uVAKwmjf
2024
-
[50]
Agentic context engineering: Evolving contexts for self-improving language models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe F ourteenth International Conference on Learning Representations, October 2025. URL...
2025
-
[51]
Expel: Llm agents are experiential learners, December 2024
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, December 2024. URL http://arxiv.org/abs/2308. 10144
2024
-
[52]
Symbolic learning enables self-evolving agents
Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, and Yuchen Eleanor Jiang. Symbolic learning enables self-evolving agents, June 2024. URL http://arxiv.org/abs/ 2406.18532
-
[53]
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions
Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Biny...
2024
-
[54]
workspace
Gregor Zunic. The bitter lesson of agent harnesses, April 2026. URL https://browser-use. com/posts/bitter-lesson-agent-harnesses. A Experimental Setup: Full Details This appendix expands the condensed Setup in §4.1 with the formal metric definitions and the runtime infrastructure. Seed agent.The seed configuration, denotedNexAU 0, is a simple code agent b...
2026
-
[55]
**Failure evidence** -- which tasks failed, and what specifically went wrong (from analysis reports or traces)
-
[56]
**Root cause** -- why it failed, not just what failed
-
[57]
**Targeted fix** -- a change that directly addresses the root cause
-
[58]
workspace
**Predicted impact** -- which tasks this should fix, and which tasks might be at risk # Environment {% if ws != "workspace" %} > **WORKSPACE PATH**: Your workspace is at`{{ ws }}/`instead of`workspace/`. All`workspace/` references below apply to`{{ ws }}/`. Use`{{ ws }}/`in file operations, git commands, and the validation command. {% endif %} > **Loop co...
-
[59]
Read`evolution_history.md`-- understand what's been tried, what worked, what failed
-
[60]
**Read`runs/iteration_NNN/input/analysis/overview.md`FIRST** -- this is your primary information source
-
[61]
**Read`runs/iteration_NNN/input/analysis/detail/{task_name}.md`** for tasks needing deeper investigation
-
[62]
Only fall back to reading raw`nexau_in_memory_tracer.cleaned.json`when analysis is missing or insufficient -- this should be rare
-
[63]
**After creating or modifying middleware**, read at least one`agent/nexau.txt`from a failed task -- it contains runtime logs (middleware init errors, warnings, crashes) that static validation cannot catch
-
[64]
Group failures into **pattern classes** -- each pattern = a class of failures, not individual tasks
-
[65]
For each pattern, identify the **root cause** and choose the most appropriate fix -- could be prompt, tool, middleware, or any component 19
-
[66]
If previous iterations already tried fixing at one level without success, try a different one
**Architecture check** -- for each failure pattern, consider whether the fix belongs at a different component level. If previous iterations already tried fixing at one level without success, try a different one
-
[67]
chg-N: <short description>
For iteration 2+, evaluate previous changes using the Change Attribution Report: - **KEEP** -- working, leave as-is - **IMPROVE** -- directionally correct, refine - **ROLLBACK + PIVOT** -- not working at this component level. Rollback the change, then re- approach the same failure pattern from a **different component level** **The sole optimization target...
-
[68]
**How to write middleware** -- base class, hook methods, params, registration, real examples from source
-
[69]
**How to create tools** -- YAML schema, Python function signature, binding, agent_state injection
-
[70]
**How to create skills** -- SKILL.md format, frontmatter, registration, loading mechanism
-
[71]
**How to create sub-agents** -- config schema, registration, invocation, context isolation
-
[72]
**YAML config schema** -- complete field reference with types, defaults, required/optional
-
[73]
Do NOT spend all your time reading
**Key runtime behaviors** -- only what's needed to write correct components # Source Code Location (READ ONLY) - NexAU framework:`{{ nexau_path }}` # Output Directory (WRITE) - Skill file:`{{ output_skill_dir }}/nexau-framework-internals/SKILL.md` # [!] MANDATORY WORKFLOW: Explore-Write-Refine Cycles You MUST follow this phased workflow. Do NOT spend all ...
-
[74]
Read key files: config dataclasses, hooks.py base class, existing middleware/tool implementations
-
[75]
**WRITE the initial SKILL.md** with whatever you have -- even if incomplete, use "[TODO]" placeholders ## Phase 2: Practical Patterns (iterations 16-60)
-
[76]
For each section below, find **real code examples** from the source
-
[77]
**After each section, immediately`write_file`to UPDATE SKILL.md**
-
[78]
Priority order: section 1 Config -> section 2 Middleware -> section 3 Tools -> section 4 Skills -> section 5 Sub-Agents -> section 6 Runtime ## Phase 3: Polish & Complete (iterations 61-80)
-
[79]
Fill remaining "[TODO]" sections, add copy-paste templates
-
[80]
No exceptions
Call`complete_task` **HARD RULES:** - You MUST call`write_file`for SKILL.md **before iteration 20**. No exceptions. - You MUST call`write_file`to update SKILL.md **at least every 15 iterations** after that. - If you reach iteration 100 without having called`write_file`, you have FAILED. - Use`read_file`with offset/limit for large files. - Cite`file:line_r...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.