Code as Agent Harness
Pith reviewed 2026-05-20 10:51 UTC · model grok-4.3
The pith
Code serves as the harness that turns large language models into executable, verifiable, and stateful AI agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems. Code now supports agent reasoning, acting, environment modeling, and execution-based verification. The survey examines the harness interface, mechanisms for long-horizon execution, and scaling to multi-agent settings where shared code artifacts enable coordination and review.
What carries the argument
Code as agent harness: the unified view that centers code as the basis for agent infrastructure, organized across the three layers of interface, mechanisms, and multi-agent scaling.
If this is right
- Applications in GUI/OS automation and embodied agents gain reliability through code-based execution and feedback control.
- Multi-agent systems achieve consistent shared state and verification via shared code artifacts.
- Evaluation of agents must move beyond final task success to include verification under incomplete feedback.
- Harness improvements can be made regression-free while supporting human oversight for safety-critical actions.
- The same harness structure extends to scientific discovery, personalization, DevOps, and enterprise workflows.
Where Pith is reading between the lines
- This structure implies that agent benchmarks should incorporate metrics for code executability and state consistency over time.
- Designers could test whether code-centric harnesses reduce error accumulation in long-horizon tasks compared with purely language-based approaches.
- The three-layer model suggests straightforward extensions to multimodal environments where code still manages execution and verification.
Load-bearing premise
Organizing the literature on code-based agent systems into the three specific layers of harness interface, mechanisms, and multi-agent scaling captures the essential structure without significant omissions or the need for additional dimensions.
What would settle it
A review that identifies a substantial set of code-enabled agent methods or applications that cannot be placed into any of the three layers would falsify the completeness of the proposed organization.
read the original abstract
Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys recent work on LLMs and agentic systems, framing code not merely as generated output but as an operational 'agent harness' substrate for reasoning, acting, environment modeling, execution-based verification, and state management. It organizes the literature into three layers—harness interface (reasoning/action/environment), harness mechanisms (planning/memory/tool use/feedback/optimization), and multi-agent scaling (coordination/review/verification via shared code)—while summarizing applications across coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows, and listing open challenges such as evaluation beyond task success, verification under incomplete feedback, and safety oversight.
Significance. If the three-layer taxonomy holds as a coherent organizing principle, the survey supplies a useful roadmap that connects LLM code capabilities with agent infrastructure, highlighting executable and verifiable systems. The paper draws on external prior work across domains rather than self-referential results, and its explicit listing of applications and challenges provides a concrete synthesis that could help researchers identify gaps in stateful, multi-agent code harnesses.
major comments (1)
- [Introduction / survey organization] Introduction and survey organization: The central claim that the three layers deliver a 'unified view' and 'unified roadmap' rests on the premise that this structure comprehensively captures code-based agent systems. However, the manuscript provides no explicit justification, comparative mapping, or discussion of why alternative dimensions (e.g., safety constraints, evaluation protocols, or regression testing) are subsumed within the layers rather than treated as orthogonal; the challenges section lists several of these topics separately without showing integration.
minor comments (2)
- [Abstract] Abstract and early sections: The phrase 'code as agent harness' is introduced as a new framing but would benefit from a concise contrast with related terms such as 'agent frameworks' or 'tool-augmented agents' to clarify novelty for readers.
- [Applications sections] Applications summary: When enumerating domains (coding assistants, embodied agents, etc.), a short table or bullet list with one representative citation per domain would improve scannability and allow readers to trace the claimed coverage.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our survey and the constructive feedback recommending minor revision. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Introduction / survey organization] Introduction and survey organization: The central claim that the three layers deliver a 'unified view' and 'unified roadmap' rests on the premise that this structure comprehensively captures code-based agent systems. However, the manuscript provides no explicit justification, comparative mapping, or discussion of why alternative dimensions (e.g., safety constraints, evaluation protocols, or regression testing) are subsumed within the layers rather than treated as orthogonal; the challenges section lists several of these topics separately without showing integration.
Authors: We thank the referee for this observation. Our three-layer taxonomy is motivated by the distinct functional roles code plays as an agent harness: the interface layer captures how code connects agents to reasoning, action, and environment modeling; the mechanisms layer addresses the operational components (planning, memory, tool use, feedback, and optimization) that enable reliable long-horizon execution; and the multi-agent scaling layer examines how shared code artifacts support coordination, review, and verification. This decomposition provides a natural progression from foundational capabilities to complex systems. Dimensions such as safety constraints, evaluation protocols, and regression testing are treated as cross-cutting concerns that appear within the layers (e.g., verification and feedback mechanisms in layer 2, human oversight and consistent state in layer 3) and are synthesized in the challenges section. We acknowledge, however, that the introduction does not explicitly justify this choice, provide a comparative mapping to alternative organizations, or demonstrate integration of the challenges back into the layers. In the revised manuscript we will add a short subsection in the introduction that (1) states the rationale for the taxonomy, (2) briefly contrasts it with orthogonal alternatives, and (3) clarifies how the listed challenges connect to and are addressed across the three layers. This addition will strengthen the claims of a unified view and roadmap. revision: yes
Circularity Check
No circularity: survey organizes external literature without self-referential derivations
full rationale
This paper is a literature survey that introduces a three-layer organizational perspective (harness interface, mechanisms, multi-agent scaling) to structure existing work on code-based agent systems. No equations, predictions, or derivations are present that could reduce to fitted inputs or self-definitions by construction. The framing is explicitly presented as a viewpoint for summarizing representative methods and applications drawn from prior external research, with open challenges listed separately. The central claim of a unified roadmap rests on this organizational synthesis rather than any load-bearing self-citation chain or ansatz that loops back to the paper's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Code can serve as an effective operational substrate for agent reasoning, acting, environment modeling, and execution-based verification in LLM-based systems.
invented entities (1)
-
Agent harness
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022
work page 2022
-
[5]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Pal: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, pages 10764–10799. PMLR, 2023
work page 2023
-
[8]
Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023
-
[9]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023
work page 2023
-
[11]
John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.Advances in Neural Information Processing Systems, 36:23826–23854, 2023
work page 2023
-
[12]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Meta-Harness: End-to-End Optimization of Model Harnesses
Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta- harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026. 67 Code as Agent Harness
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Effective harnesses for long-running agents
Justin Young. Effective harnesses for long-running agents. Anthropic Engineer- ing Blog, November 2025. URL https://www.anthropic.com/engineering/ effective-harnesses-for-long-running-agents. Accessed: 2026-05-11
work page 2025
-
[16]
Harness engineering: Leveraging codex in an agent-first world.https://openai
Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world.https://openai. com/index/harness-engineering/, 2026. OpenAI Engineering Blog, February 11, 2026. Ac- cessed: 2026-05-10
work page 2026
-
[17]
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Kechi Zhang, Huangzhao Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. Toolcoder: Teach code generation models to use api search tools.arXiv preprint arXiv:2305.04032, 2023
-
[20]
Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. Teaching code llms to use autocompletion tools in repository-level code generation.ACM Transactions on Software Engineering and Methodology, 34(7):1–27, 2025
work page 2025
-
[21]
Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025
Boaz Lavon, Shahar Katz, and Lior Wolf. Execution guided line-by-line code generation.arXiv preprint arXiv:2506.10948, 2025
-
[22]
Computer Environments Elicit General Agentic Intelligence in LLMs
Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, and Furu Wei. Computer environments elicit general agentic intelligence in llms.arXiv preprint arXiv:2601.16206, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Dekun Dai, MingWei Liu, Anji Li, Jialun Cao, Yanlin Wang, Chong Wang, Xin Peng, and Zibin Zheng. Feedbackeval: A benchmark for evaluating large language models in feedback-driven code repair tasks.arXiv preprint arXiv:2504.06939, 2025
-
[24]
Harness engineering: Leveraging codex in an agent-first world
Ryan Lopopolo. Harness engineering: Leveraging codex in an agent-first world. OpenAI Engineering Blog, February 2026. URLhttps://openai.com/index/harness-engineering/. Accessed: 2026-05-11
work page 2026
-
[25]
The anatomy of an agent harness
Vivek Trivedy. The anatomy of an agent harness. https://www.langchain.com/blog/ the-anatomy-of-an-agent-harness, 2026. LangChain blog. Accessed: 2026-05-10
work page 2026
-
[26]
Anthropic. Claude code. https://www.anthropic.com/product/claude-code. Accessed: 2026-05-09
work page 2026
-
[27]
OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. OpenAI announcement. 68 Code as Agent Harness
work page 2025
-
[28]
Improving deep agents with harness engineering
Vivek Trivedy. Improving deep agents with harness engineering. https://www.langchain. com/blog/improving-deep-agents-with-harness-engineering, 2026. LangChain blog. Accessed: 2026-05-10
work page 2026
-
[29]
Xi Ye, Qiaochu Chen, Isil Dillig, and Greg Durrett. Satlm: Satisfiability-aided language models using declarative prompting.Advances in Neural Information Processing Systems, 36:45548–45580, 2023
work page 2023
-
[30]
Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, and Pengcheng Yin. Next: Teaching large language models to reason about code execution.arXiv preprint arXiv:2404.14662, 2024
-
[31]
Codeprm: Execution feedback-enhanced process reward model for code generation
Qingyao Li, Xinyi Dai, Xiangyang Li, Weinan Zhang, Yasheng Wang, Ruiming Tang, and Yong Yu. Codeprm: Execution feedback-enhanced process reward model for code generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8169–8182, 2025
work page 2025
-
[32]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models, 2023. URL https://arxiv. org/abs/2305.16291, 2(11), 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, et al. Robocodex: Multimodal code generation for robotic behavior synthesis.arXiv preprint arXiv:2402.16117, 2024
-
[34]
Siyang Zhang, Bin Li, Jingtao Qi, Xueying Wang, Fu Li, Jianan Wang, En Zhu, and Jinjing Sun. Code- bt: A code-driven approach to behavior tree generation for robot tasks planning with large language models. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 8814–8822, 2025
work page 2025
-
[35]
Ui-voyager: A self-evolving gui agent learning via failed experience
Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, et al. Ui-voyager: A self-evolving gui agent learning via failed experience. arXiv preprint arXiv:2603.24533, 2026
-
[36]
Hao Tang, Darren Key, and Kevin Ellis. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment.Advances in Neural Information Processing Systems, 37:70148–70212, 2024
work page 2024
-
[37]
Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387, 2025
-
[38]
Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026
Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856, 2026
-
[39]
Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026
Kanishk Gandhi, Shivam Garg, Noah D Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443, 2026
-
[40]
Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology, 33(7):1–30, 2024. 69 Code as Agent Harness
work page 2024
-
[41]
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D C, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, Balasubramanyan Ashok, and Shashank Shet. Codeplan: Repository-level coding using llms and planning.Proceedings of the ACM on Software Engineering, 1(FSE):675–698, 2024
work page 2024
-
[43]
Codetree: Agent-guided tree search for code generation with large language models
Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Codetree: Agent-guided tree search for code generation with large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3711–3726, 2025
work page 2025
-
[44]
Mapcoder: Multi-agent code generationforcompetitiveproblemsolving
Md Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapcoder: Multi-agent code generationforcompetitiveproblemsolving. InProceedingsofthe62ndAnnualMeetingoftheAssociation for Computational Linguistics (Volume 1: Long Papers), pages 4912–4944, 2024
work page 2024
-
[45]
NishantGaurav, AditAkarsh, TejasRavishankar, andManojBajaj. Codemem: Architectingreproducible agents via dynamic mcp and procedural memory.arXiv preprint arXiv:2512.15813, 2025
-
[46]
Autocoderover: Autonomous program improvement
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1592–1604, 2024
work page 2024
-
[47]
Repocoder: Repository-level code completion through iterative retrieval and generation
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023
work page 2023
-
[48]
Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, et al. Memgovern: Enhancing code agents through learning from governed human experiences.arXiv preprint arXiv:2601.06789, 2026
-
[49]
Xukun Liu, Zhiyuan Peng, Xiaoyuan Yi, Xing Xie, Lirong Xiang, Yuchen Liu, and Dongkuan Xu. Toolnet: Connecting large language models with massive tools via tool graph.arXiv preprint arXiv:2403.00839, 2024
-
[50]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. Agent- coder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Adacoder: Adaptive prompt compression for programmatic visual question answering
Mahiro Ukai, Shuhei Kurita, Atsushi Hashimoto, Yoshitaka Ushiku, and Nakamasa Inoue. Adacoder: Adaptive prompt compression for programmatic visual question answering. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9234–9243, 2024
work page 2024
-
[52]
A.Nunez, N.T.Islam, S.K.Jha, andP.Najafirad. AutoSafeCoder: Amulti-agentframeworkforsecuring LLM code generation through static analysis and fuzz testing.arXiv preprint arXiv:2409.10737, 2024
-
[53]
Agent harness engineering: A survey, 2026
Junjie Li, Xi Xiao, Yunbei Zhang, Chen Liu, Lin Zhao, Xiaoying Liao, Yingrui Ji, Janet Wang, Jianyang Gu, Yingqiang Ge, Weijie Xu, Xi Fang, Xiang Xu, Tianchen Zhao, Youngeun Kim, Tianyang Wang, Jihun Hamm, Smita Krishnaswamy, Jun Huan, and Chandan Reddy. Agent harness engineering: A survey, 2026. URLhttps://openreview.net/pdf?id=eONq7FdiHa. 70 Code as Age...
work page 2026
-
[54]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024
work page 2024
-
[55]
Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, andJürgenSchmidhuber. Metagpt: MetaprogrammingforAmulti-agentcollaborativeframework. InThe Twelfth International Conference on Learning Representations, ICLR 2024, ...
work page 2024
-
[56]
Y. Dong, X. Jiang, Z. Jin, and G. Li. Self-collaboration code generation via ChatGPT.ACM Transactions on Software Engineering and Methodology, 33(7):1–38, 2024
work page 2024
-
[57]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
work page 2024
-
[58]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...
work page 2025
-
[59]
Mind2web: Towards a generalist agent for the web, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URLhttps://arxiv.org/abs/2306. 06070
work page 2023
-
[60]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
ChemCrow: Augmenting large-language models with chemistry tools
Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large-language models with chemistry tools, 2023. URLhttps://arxiv. org/abs/2304.05376
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023
Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023
work page 2023
-
[63]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/ 2408.06292
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Biomni: A general-purpose biomedical ai agent.biorxiv, 2025
Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, et al. Biomni: A general-purpose biomedical ai agent.biorxiv, 2025
work page 2025
-
[65]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[66]
Show your work: Scratchpads for intermediate computation with language models
MaxwellNye, AndersJohanAndreassen, GuyGur-Ari, HenrykMichalewski, JacobAustin, DavidBieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021. 71 Code as Agent Harness
work page 2021
-
[67]
Reasoning like program executors
Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Qiang Fu, Yan Gao, Jian-Guang Lou, and Weizhu Chen. Reasoning like program executors. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 761–779, 2022
work page 2022
-
[68]
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv preprint arXiv:2310.03731, 2023
-
[69]
Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thought works for reasoning? InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17691–17699, 2024
work page 2024
-
[70]
Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udom- charoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, and Sarana Nutanong. Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments. InFindings of the Association for Computational Linguistics: A...
work page 2025
-
[71]
Hong Su. Method-based reasoning for large language models: Extraction, reuse, and continuous improvement.arXiv preprint arXiv:2508.04289, 2025
-
[72]
Cedegao E Zhang, Cédric Colas, Gabriel Poesia, Joshua B Tenenbaum, and Jacob Andreas. Code- enabled language models can outperform reasoning models on diverse tasks.arXiv preprint arXiv:2510.20909, 2025
-
[73]
CodeIO: Condensing reasoning patterns via code input-output prediction
Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, and Junxian He. CodeIO: Condensing reasoning patterns via code input-output prediction. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ...
work page 2025
-
[74]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024
work page 2024
-
[75]
Zhongwei Yu, Wannian Xia, Xue Yan, Bo Xu, Haifeng Zhang, Yali Du, and Jun Wang. Self-verifying reflection helps transformers with cot reasoning.arXiv preprint arXiv:2510.12157, 2025
-
[76]
Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, and Tong Zhang. Ma-lot: Multi-agent lean-based long chain-of-thought reasoning enhances formal theorem proving.arXiv e-prints, pages arXiv–2503, 2025
work page 2025
-
[77]
Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025
Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Ssr: Socratic self-refine for large language model reasoning.arXiv preprint arXiv:2511.10621, 2025
-
[78]
Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, and Chuchu Fan. Codesteer: Symbolic- augmented language models via code/text guidance.arXiv preprint arXiv:2502.04350, 2025. 72 Code as Agent Harness
-
[79]
Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation
Yongchao Chen, Yilun Hao, Yang Zhang, and Chuchu Fan. Code-as-symbolic-planner: Foundation model-based robot planning via symbolic code generation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 19248–19254. IEEE, 2025
work page 2025
-
[80]
Cuong Le Chi, Chau Truong Vinh Hoang, Phan Nhat Huy, Dung D. Le, Tien N Nguyen, and Nghi D. Q. Bui. VisualCoder: Guiding large language models in code execution with fine-grained multimodal chain-of-thought reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 6643–6660,...
-
[81]
The lean 4 theorem prover and programming language
Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction, pages 625–635. Springer, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.