pith. sign in

arxiv: 2606.05238 · v1 · pith:N7JJ5YQ5new · submitted 2026-06-03 · 💻 cs.SE

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

Pith reviewed 2026-06-28 05:32 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM agentsresearch artifact deploymentenvironment setupbenchmarksoftware engineeringtask completionagent evaluationdeployment tasks
0
0 comments X

The pith

LLM agents achieve pass rates of 7.8 to 51 percent when deploying research artifacts across 51 tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeployBench, a benchmark of 51 tasks that requires LLM agents to set up runnable environments for published research papers in AI, systems, and scientific computing. Each task demands handling multi-language dependencies, GPU configurations, and legacy compatibility, with success defined by a hidden pipeline that runs the paper's designated experiment and validates its outputs. Testing four state-of-the-art LLMs through the OpenHands framework produces the reported pass rates. The dominant failure mode is agents terminating after their own checks confirm a different or weaker outcome than the full task requires, accounting for 97 of 154 cases. The benchmark therefore measures the distance between current agent capabilities and autonomous research deployment.

Core claim

DeployBench consists of 51 research-artifact deployment tasks verified by hidden pipelines that execute each paper's designated experiment and check its outputs. When four state-of-the-art LLMs are evaluated with OpenHands, pass rates range from 7.8 percent to 51.0 percent. Failures are dominated by a completion-judgment problem in which 97 of 154 cases are agent-terminated self-stops that validate a different or weaker target than the paper-specific task requires.

What carries the argument

The hidden verification pipeline that executes the paper's designated experiment and checks its outputs to determine whether deployment succeeded.

If this is right

  • Agents must develop more precise pre-termination checks that align with paper-specific experimental requirements rather than weaker internal criteria.
  • Successful deployment requires managing system-level dependencies such as GPU and CUDA configurations in addition to code-level setup.
  • Pass rates remain low even for current leading models, indicating that autonomous research-artifact deployment is not yet reliable across the tested domains.
  • The benchmark supplies a concrete testbed that can track progress as agent judgment and environment-handling capabilities improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the judgment failures can be reduced, overall success rates on similar deployment tasks could increase markedly.
  • The same self-stop pattern may limit agent performance on other multi-step benchmarks that involve hidden or paper-specific success criteria.
  • Adding tasks from additional research domains would test whether the observed failure distribution holds beyond the current 51 tasks.

Load-bearing premise

The hidden pipelines accurately reproduce the original papers' intended experiments and produce correct pass/fail signals.

What would settle it

Running each hidden pipeline on an artifact that has been manually deployed according to the original paper and confirming whether the pipeline accepts or rejects that deployment.

read the original abstract

LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment setup benchmarks do not cover the full scope of research artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g. GPU/CUDA and kernel configurations), and legacy artifact compatibility. We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing, covering all these dimensions. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass-rates from 7.8% - 51.0% . Failures are dominated by a completion-judgment problem: 97 of 154 are agent-terminated self-stops, where the agent's pre-finish checks validate a different or weaker target than the paper-specific task requires. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces DeployBench, a benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass rates from 7.8% to 51.0%, with failures dominated by a completion-judgment problem (97 of 154 agent-terminated self-stops where agents validate weaker targets).

Significance. If the verification pipeline faithfully reproduces the original papers' experiments and the task set is representative, DeployBench would offer a valuable, realistic testbed for assessing LLM agents on complex, multi-language deployment scenarios that existing benchmarks overlook. The reported pass rates and failure-mode breakdown could usefully highlight gaps in current agent capabilities for autonomous research-artifact setup.

major comments (3)
  1. [Abstract] Abstract: All numeric results (pass rates 7.8–51 %, 97/154 self-stop failures) are defined relative to judgments from an uninspectable hidden verification pipeline. No description, code, example traces, or implementation details of this pipeline are supplied, so the grounding of the central claims cannot be assessed or reproduced.
  2. [Abstract] Abstract: The manuscript provides no details on task selection criteria, inclusion/exclusion rules, or domain-balance controls for the 51 tasks. Without these, it is impossible to evaluate whether the benchmark fairly represents the space of research-artifact deployment or whether the reported performance gap is an artifact of task curation.
  3. [Abstract] Abstract: No information is given on the number of independent runs per task, statistical significance testing of the pass rates, or controls for agent-prompting variations. These omissions make it difficult to determine whether the observed differences across the four LLMs are robust.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: All numeric results (pass rates 7.8–51 %, 97/154 self-stop failures) are defined relative to judgments from an uninspectable hidden verification pipeline. No description, code, example traces, or implementation details of this pipeline are supplied, so the grounding of the central claims cannot be assessed or reproduced.

    Authors: We agree that the manuscript provides insufficient detail on the verification pipeline. The pipeline is kept hidden during agent runs to prevent exploitation of verification logic, but this choice limits external assessment. In revision we will add a methods subsection describing the pipeline architecture, the execution of each paper's designated experiment, output-checking logic, and example verification scripts for three representative tasks (one per domain). We will also outline a controlled-release process for the full pipeline to qualified researchers. These additions will ground the reported pass rates without exposing the benchmark to contamination. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript provides no details on task selection criteria, inclusion/exclusion rules, or domain-balance controls for the 51 tasks. Without these, it is impossible to evaluate whether the benchmark fairly represents the space of research-artifact deployment or whether the reported performance gap is an artifact of task curation.

    Authors: We acknowledge the omission of explicit selection criteria. The 51 tasks were assembled to span AI/ML, systems, and scientific computing while covering multi-language toolchains, non-container dependencies, and legacy compatibility. In the revision we will insert a dedicated subsection that states the sourcing process, inclusion/exclusion rules (e.g., public GitHub artifacts with runnable experiments, post-2020 papers), and domain-balance targets. This will allow readers to judge representativeness. revision: yes

  3. Referee: [Abstract] Abstract: No information is given on the number of independent runs per task, statistical significance testing of the pass rates, or controls for agent-prompting variations. These omissions make it difficult to determine whether the observed differences across the four LLMs are robust.

    Authors: The current manuscript reports single-run results per model–agent pair, driven by compute cost, and does not include statistical tests or prompting-ablation controls. We will revise the evaluation section to document the exact protocol, note the single-run limitation explicitly, and add a limitations paragraph discussing robustness. Where feasible we will report any repeated runs performed during development; full multi-run statistics and prompting controls are planned for a follow-up release but cannot be retrofitted to the existing data without new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; benchmark results are direct empirical measurements

full rationale

The paper introduces DeployBench as an empirical benchmark consisting of 51 tasks whose success is defined by an external hidden verification pipeline that runs each paper's designated experiment. Reported pass rates (7.8%-51.0%) and failure counts (97/154 self-stops) are presented as direct measurements against those externally specified targets. No equations, fitted parameters, predictions derived from first principles, ansatzes, or uniqueness theorems appear in the provided text. No self-citations are invoked to justify any load-bearing step. The verification pipeline is an uninspectable assumption about task construction, but this is a question of external grounding rather than any reduction of a claimed derivation to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full paper may contain additional details on task curation and pipeline design.

axioms (1)
  • domain assumption The 51 selected tasks adequately represent the full scope of research artifact deployment challenges across the three domains.
    The benchmark's value rests on this representativeness claim stated in the abstract.

pith-pipeline@v0.9.1-grok · 5775 in / 1237 out tokens · 27672 ms · 2026-06-28T05:32:58.763342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInterna- tional Conference on Learning Representations, 2024

  2. [2]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. ArXiv, abs/2405.15793, 2024

  3. [3]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.ArXiv, abs/2504.01848, 2025

  4. [4]

    Lmr-bench: Evaluating llm agent’s ability on reproducing language modeling research

    Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. Lmr-bench: Evaluating llm agent’s ability on reproducing language modeling research. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNL...

  5. [5]

    MLE- bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE- bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, 2025

  6. [6]

    DSBench: How far are data science agents from becoming data science experts? InInternational Conference on Learning Representations, 2025

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench: How far are data science agents from becoming data science experts? InInternational Conference on Learning Representations, 2025

  7. [7]

    Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi Jin

    Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi...

  8. [8]

    Gonzalez, Jingbo Shang, and Alvin Cheung

    Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...

  9. [9]

    Mitigating configuration differences between development and production environments: A catalog of strategies, 2025

    Marcos Nazario, Rodrigo Bonifacio, and Gustavo Pinto. Mitigating configuration differences between development and production environments: A catalog of strategies, 2025. 12

  10. [10]

    Understanding llm-centric challenges for deep learning frameworks: An empirical analysis, 2025

    Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Jiacong Wu, An Guo, Jiawei Shen, Bingzhuo Li, and Zhenyu Chen. Understanding llm-centric challenges for deep learning frameworks: An empirical analysis, 2025

  11. [11]

    SetupBench: Assessing software engineering agents’ ability to bootstrap development environments.ArXiv, abs/2507.09063, 2025

    Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. SetupBench: Assessing software engineering agents’ ability to bootstrap development environments.ArXiv, abs/2507.09063, 2025

  12. [12]

    EnvBench: A benchmark for automated environment setup.ArXiv, abs/2503.14443, 2025

    Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. EnvBench: A benchmark for automated environment setup.ArXiv, abs/2503.14443, 2025

  13. [13]

    CSR-Bench: Benchmarking LLM agents in deployment of computer science research repositories

    Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, and Wei Wang. CSR-Bench: Benchmarking LLM agents in deployment of computer science research repositories. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12705–12723, 2025

  14. [14]

    ResearchEnvBench: Benchmarking agents on environment synthesis for research code execution.ArXiv, abs/2603.06739, 2026

    Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, and Xipeng Qiu. ResearchEnvBench: Benchmarking agents on environment synthesis for research code execution.ArXiv, abs/2603.06739, 2026

  15. [15]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Daniel Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI softwar...

  16. [16]

    RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation

    Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System D...

  17. [17]

    CoReQA: Uncovering potentials of language models in code repository question answering.ArXiv, abs/2501.03447, 2025

    Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng. CoReQA: Uncovering potentials of language models in code repository question answering.ArXiv, abs/2501.03447, 2025

  18. [18]

    SWE-QA: Can Language Models Answer Repository-level Code Questions?

    Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. SWE-QA: Can language models answer repository-level code questions?ArXiv, abs/2509.14635, 2025

  19. [19]

    CodeUpdateArena: Bench- marking knowledge editing on API updates.ArXiv, abs/2407.06249, 2024

    Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. CodeUpdateArena: Bench- marking knowledge editing on API updates.ArXiv, abs/2407.06249, 2024

  20. [20]

    CodeCrit- icBench: A holistic code critique benchmark for large language models.ArXiv, abs/2502.16614, 2025

    Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, and Zhaoxiang Zhang. CodeCrit- icBench: A holistic code critique benchmark for large language models.ArXiv, abs/2502.16614, 2025

  21. [21]

    Merrill, Alexander G

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, 13 Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen,...

  22. [22]

    Beyond pip install: Evaluating LLM agents for the automated installation of Python projects

    Louis Milliken, Sungmin Kang, and Shin Yoo. Beyond pip install: Evaluating LLM agents for the automated installation of Python projects. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 1–11, 2025

  23. [23]

    You name it, i run it: An LLM agent to execute tests of arbitrary projects.Proceedings of the ACM on Software Engineering, 2(ISSTA):1054–1076, 2025

    Islem Bouzenia and Michael Pradel. You name it, i run it: An LLM agent to execute tests of arbitrary projects.Proceedings of the ACM on Software Engineering, 2(ISSTA):1054–1076, 2025

  24. [24]

    Repo2Run: Automated building executable environment for code repository at scale

    Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2Run: Automated building executable environment for code repository at scale. InAdvances in Neural Information Processing Systems, 2025

  25. [25]

    Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, and Philip S. Yu. Process-level trajectory evaluation for environment configuration in software engineering agents. InInternational Conference on Learning Representations, 2026

  26. [26]

    HerAgent: Rethinking the automated environment deployment via hierarchical test pyramid.ArXiv, abs/2602.07871, 2026

    Xiang Li, Siyu Lu, Federica Sarro, Claire Le Goues, and He Ye. HerAgent: Rethinking the automated environment deployment via hierarchical test pyramid.ArXiv, abs/2602.07871, 2026

  27. [27]

    DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder

    Jiaran Zhang, Luck Ma, Fanqi Wan, Di Qi, Xu Zhao, Jieyi Hou, Zhe Xie, Mengqiang Ren, Xin Wu, Zhewei Huang, Liangyu Chen, Qi Han, and Xiangyu Zhang. DockSmith: Scaling reliable coding environments via an agentic docker builder.ArXiv, abs/2602.00592, 2026

  28. [28]

    Systems Research Artifacts

    Systems Research Artifacts. Systems Research Artifacts. https://sysartifacts.github.io/,

  29. [29]

    run via docker

    Last accessed: May 2026. 14 Appendix The appendix is structured as follows: • Full list of benchmark source artifacts (Section A1). • The full agent system prompt used in all runs (Section A2). • The full diagnostic-agent prompt used to diagnose failed runs (Section A3). • Failure pattern counts across all-fail and mixed-outcome tasks (Section A4). • Addi...

  30. [30]

    - Identify: the code folder and the paper PDF

    Initial inspection - List the contents of <WORKDIR>. - Identify: the code folder and the paper PDF. - Create vm in a new directory vm under <WORKDIR>, ONLY if you must use a VM

  31. [31]

    ABS_PATH

    Read instructions and infer requirements - Read the paper PDF to understand: required OS/kernel assumptions, hardware assumptions, and what a minimal smoke test would be (not the full benchmarks). - Read README / INSTALL / scripts in the code. - If instructions assume Docker, translate them into native host steps. Agent skills (optional -- use when helpfu...

  32. [32]

    - Install with apt when appropriate

    Dependency resolution - Determine all build/runtime dependencies (compilers, libraries, Python/Rust/Go/Java, CUDA, etc.). - Install with apt when appropriate. - For language-specific deps: - Python: Keep the paper's environment isolated from the agent's own runtime; Create and use a project-specific venv under <WORKDIR>/env/. Do not use uv, conda, or any ...

  33. [33]

    - Fix path issues so everything runs when invoked from within <WORKDIR>

    Build and configure - Build the artifact as required (e.g., make/cmake/bazel/meson). - Fix path issues so everything runs when invoked from within <WORKDIR>. - If the artifact or smoke test requires downloaded models, datasets, or weights: download them and ensure the smoke test can use them. Do not skip downloads needed for a minimal run

  34. [34]

    small demo, or a short run with minimal data)

    Run a simple smoke test - Execute a minimal check that the setup works (e.g. small demo, or a short run with minimal data). Do NOT run the full paper experiments or long benchmarks

  35. [35]

    - Use QEMU/KVM if available; create VM disk under <WORKDIR>/vm/

    If a VM is needed (only as last resort) - Explain why native host execution is infeasible. - Use QEMU/KVM if available; create VM disk under <WORKDIR>/vm/. - Provide: - VM OS image source and checksums if applicable - VM config (CPU/RAM/disk) and exact launch command(s) - How files are shared between host and VM (e.g., virtiofs/9p/scp) while keeping proje...