pith. sign in

arxiv: 2606.28436 · v1 · pith:F6LKGV45new · submitted 2026-06-26 · 💻 cs.SE · cs.AI

Dockerless: Environment-Free Program Verifier for Coding Agents

Pith reviewed 2026-06-30 01:30 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords program verifiercoding agentspatch verificationenvironment-freeagentic explorationSWE-benchpost-trainingreinforcement learning
0
0 comments X

The pith

Dockerless judges code patches correct via agentic repository exploration instead of execution or Docker environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dockerless as a program verifier that determines whether generated code patches are correct by collecting evidence through agentic exploration of the target repository. This verifier is then used to filter trajectories for supervised fine-tuning and to supply rewards during reinforcement learning. The resulting pipeline trains coding agents without any per-repository environment setup or test execution. On SWE-bench Verified, Multilingual, and Pro the trained model reaches 62.0 percent, 50.0 percent, and 35.2 percent resolve rate, exceeding the Qwen3.5-9B baseline while equaling the performance of environment-based training. A reader would care because the method removes the dominant cost of maintaining Docker images for large-scale agent post-training.

Core claim

Dockerless is an environment-free agentic patch verifier that evaluates generated code patches without executing them or matching them to references; instead it judges correctness from evidence gathered through agentic repository exploration. When the same verifier is applied both as the SFT trajectory filter and as the RL reward signal, it produces a fully environment-free post-training pipeline whose model reaches 62.0 percent, 50.0 percent, and 35.2 percent resolve rate on SWE-bench Verified, Multilingual, and Pro respectively, exceeding the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points and matching the results of environment-based post-training.

What carries the argument

Dockerless, an agentic patch verifier that gathers evidence through repository exploration to judge patch correctness without execution.

If this is right

  • A fully environment-free post-training pipeline becomes possible for coding agents.
  • The verifier can be used simultaneously for SFT trajectory filtering and RL reward assignment.
  • Performance on SWE-bench Verified, Multilingual, and Pro matches that of execution-based training.
  • Dockerless outperforms the strongest open-source verifier by 14.3 AUC points on a dedicated verifier benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same exploration-based judgment approach could be tested on verification tasks outside software, such as formal proof checking.
  • If exploration evidence proves reliable, training pipelines could drop unit-test requirements entirely for some domains.
  • Larger-scale repository exploration might improve verification accuracy on very large codebases where test coverage is incomplete.
  • The method opens the possibility of mixing Dockerless signals with lightweight static analysis to reduce any remaining judgment errors.

Load-bearing premise

Evidence gathered by agentic repository exploration is sufficient to determine patch correctness without any execution or test running.

What would settle it

A large set of patches whose correctness judgments from Dockerless disagree with the outcomes of actual unit-test execution on the same patches.

read the original abstract

Program verifiers play a central role in training coding agents, including selecting trajectories for supervised fine-tuning (SFT) and providing rewards for reinforcement learning (RL). Standard execution-based verification requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs. We propose Dockerless, an environment-free agentic patch verifier that evaluates generated code patches without executing them. Rather than simply matching candidate patches to references, Dockerless judges patch correctness using evidence gathered through agentic repository exploration. On a verifier evaluation benchmark, Dockerless outperforms the strongest open-source verifier by 14.3 AUC points. Using Dockerless as both the SFT trajectory filter and the RL reward enables a fully environment-free post-training pipeline. The resulting model reaches 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, respectively. It surpasses the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points, matching environment-based post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Dockerless, an agentic, environment-free program verifier for code patches. It judges patch correctness via evidence from repository exploration rather than execution or test running. Dockerless is reported to outperform the strongest open-source verifier by 14.3 AUC points on a verifier benchmark. Using it for both SFT trajectory filtering and RL rewards produces a post-training pipeline that yields 62.0%, 50.0%, and 35.2% resolve rates on SWE-bench Verified, Multilingual, and Pro, surpassing the Qwen3.5-9B baseline by 2.4–8.7 points and matching environment-based post-training.

Significance. If the verifier's judgments prove reliable proxies for actual patch correctness, the work would be significant for scaling coding-agent post-training. Removing per-repository Docker setup and execution costs could lower barriers to large-scale SFT/RL pipelines while preserving performance, which is a practical advance for the field.

major comments (2)
  1. [Dockerless verifier description and post-training experiments] The central claim that Dockerless enables environment-free post-training matching execution-based results rests on the assumption that agentic repository exploration alone suffices to determine patch correctness. The manuscript provides no analysis or ablation showing that this approach captures failure modes (e.g., runtime behavior, unvisited paths, or environment-specific interactions) that execution-based verification would detect; without such evidence the reported resolve-rate gains could reflect verifier artifacts rather than genuine quality.
  2. [Verifier evaluation section] The verifier benchmark result (14.3 AUC gain) is presented without accompanying details on dataset construction, baseline implementations, controls for exploration depth, or error analysis of false positives/negatives. This makes it impossible to determine whether the AUC improvement generalizes or is load-bearing for the downstream SFT/RL claims.
minor comments (2)
  1. [Abstract] The abstract states concrete performance numbers (AUC gain, resolve rates) with no accompanying method or dataset summary; this should be expanded for readability even if details appear later.
  2. [Method] Notation for the agentic exploration process (e.g., how evidence is aggregated into a correctness judgment) is introduced without a clear algorithmic pseudocode or formal definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing to revisions that strengthen the presentation of evidence while defending the empirical results as reported.

read point-by-point responses
  1. Referee: [Dockerless verifier description and post-training experiments] The central claim that Dockerless enables environment-free post-training matching execution-based results rests on the assumption that agentic repository exploration alone suffices to determine patch correctness. The manuscript provides no analysis or ablation showing that this approach captures failure modes (e.g., runtime behavior, unvisited paths, or environment-specific interactions) that execution-based verification would detect; without such evidence the reported resolve-rate gains could reflect verifier artifacts rather than genuine quality.

    Authors: We acknowledge that the manuscript does not contain explicit ablations isolating failure modes such as runtime behavior, unvisited paths, or environment-specific interactions. The reported resolve rates (62.0/50.0/35.2%) are shown to match environment-based post-training on the three SWE-bench splits, which provides indirect support for the verifier's utility, but we agree this does not substitute for targeted analysis. In the revision we will add a dedicated limitations subsection with qualitative examples of cases where repository exploration may miss execution-dependent errors, while retaining the empirical matching results as evidence of practical equivalence for the evaluated benchmarks. revision: partial

  2. Referee: [Verifier evaluation section] The verifier benchmark result (14.3 AUC gain) is presented without accompanying details on dataset construction, baseline implementations, controls for exploration depth, or error analysis of false positives/negatives. This makes it impossible to determine whether the AUC improvement generalizes or is load-bearing for the downstream SFT/RL claims.

    Authors: We agree that the verifier evaluation section requires additional methodological detail to support the 14.3 AUC claim and its connection to the post-training results. The revised manuscript will expand this section to describe the benchmark dataset construction process, the exact baseline verifier implementations and hyperparameters, controls and sensitivity analysis for exploration depth, and a breakdown of false-positive and false-negative cases with representative examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external benchmarks

full rationale

The paper describes an agentic verifier (Dockerless) whose correctness judgments are produced by repository exploration rather than execution. It reports an AUC improvement on a verifier benchmark and downstream resolve rates on SWE-bench variants that are measured by the standard execution-based protocol. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The central claim (environment-free training matches environment-based results) is an empirical comparison against external benchmarks and baselines, not a reduction to the verifier's own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no free parameters, invented entities, or explicit axioms are stated. The central assumption that agentic exploration yields reliable correctness signals is treated as a domain assumption.

axioms (1)
  • domain assumption Agentic repository exploration can gather sufficient evidence to judge patch correctness without execution
    This premise is required for the environment-free claim to hold; it is invoked by the description of how Dockerless works.

pith-pipeline@v0.9.1-grok · 5759 in / 1394 out tokens · 46564 ms · 2026-06-30T01:30:54.628274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 11 linked inside Pith

  1. [1]

    Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, and Alexander Golubev. 2026b. Swe-rebench v2: Language-agnostic swe task collection at scale. arXiv preprint arXiv:2602.23866. Silin Chen, Shaoxin Lin, Y uling Shi, Heng Lian, Xiaodong Gu, Longfei Y un, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, and 1 others

  2. [2]

    arXiv preprint arXiv:2507.23361

    Swe-exp: Experience-driven software issue resolution. arXiv preprint arXiv:2507.23361. Xiang Deng, Jeff Da, Edwin Pan, Y annis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, and 1 others

  3. [3]

    Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R

    Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941. Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. 2025a. Search-based llms for code optimization. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 – May 6, 2025 , pages 578–590. IEEE...

  4. [4]

    What makes good in-context demonstrations for code intelligence tasks with llms? In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11–15, 2023 , pages 761–773. IEEE. Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey...

  5. [5]

    arXiv preprint arXiv:2508.03501

    Training long-context, multi-turn software engineering agents with reinforcement learning. arXiv preprint arXiv:2508.03501. Hao Han, Jin Xie, Xuehao Ma, Weiquan Zhu, Ziyao Zhang, ZhiLiang Long, Hongkai Chen, and Qingwen Y e

  6. [6]

    arXiv preprint arXiv:2604.14820

    Swe- trace: Optimizing long-horizon swe agents through rubric process reward models and heuristic test-time scaling. arXiv preprint arXiv:2604.14820. Chao Hu, Wenhao Zeng, Y uling Shi, Beijun Shen, and Xiaodong Gu

  7. [7]

    arXiv preprint arXiv:2601.00376

    In line with context: Repository-level code generation via context inlining. arXiv preprint arXiv:2601.00376. Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica

  8. [8]

    arXiv preprint arXiv:2504.07164

    R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents. arXiv preprint arXiv:2504.07164. Carlos E Jimenez, John Y ang, Alexander Wettig, Shunyu Y ao, Kexin Pei, Ofir Press, and Karthik Narasimhan

  9. [9]

    Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang

    Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations , volume 2024, pages 54107–54157. Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang

  10. [10]

    arXiv preprint arXiv:2507.23348

    Swe- debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348. Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, and 1 others

  11. [11]

    arXiv preprint arXiv:2603.05026

    Repolaunch: Automating build&test pipeline of code repositories on any language and any platform. arXiv preprint arXiv:2603.05026. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others

  12. [12]

    2: Pushing the frontier of open large language models

    Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556. Nikolai Ludwig, Wasi Uddin Ahmad, Somshubra Majumdar, and Boris Ginsburg

  13. [13]

    arXiv preprint arXiv:2604.01496

    From swe-zero to swe-hero: Execution-free to execution-based fine-tuning for software engineering agents. arXiv preprint arXiv:2604.01496. Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Tarun V enkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and ...

  14. [14]

    arXiv preprint arXiv:2602.04254

    Scaling agentic verifier for competitive coding. arXiv preprint arXiv:2602.04254. OpenAI

  15. [15]

    In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13

    Domain knowledge matters: Improving prompts with fix templates for repairing python type errors . In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13. ACM. Qwen Team

  16. [16]

    arXiv preprint arXiv:2601.04171

    Agentic rubrics as contextual verifiers for swe agents. arXiv preprint arXiv:2601.04171. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, and 1 others

  17. [17]

    arXiv preprint arXiv:2402.03300

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 11 REFERENCES 6 Y uling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu

  18. [18]

    arXiv preprint arXiv:2410.01215

    From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215. KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, Jiaxi Y ang, Y uzhen Huang, Junyang Lin, Junxian He, and 1 others

  19. [19]

    arXiv preprint arXiv:2512.21919

    Swe-rm: Execution-free feedback for software engineering agents. arXiv preprint arXiv:2512.21919. Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, and 1 others

  20. [20]

    arXiv preprint arXiv:2602.03419

    Swe-world: Building software engineering agents in docker-free environments. arXiv preprint arXiv:2602.03419. Chaofan Tao, Jierun Chen, Y uxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Y ang, Yiming Du, Jianbo Dai, and 1 others

  21. [21]

    arXiv preprint arXiv:2601.01426

    Swe-lego: Pushing the limits of supervised fine-tuning for software issue resolving. arXiv preprint arXiv:2601.01426. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and 1 others

  22. [22]

    5: Visual agentic intelligence

    Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276. Haoran Wang, Zhenyu Hou, Y ao Wei, Jie Tang, and Y uxiao Dong. 2025a. Swe-dev: Building software engineering agents with training and inference scaling. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 3742–3761. Xingyao Wang, V alerie Chen, Heng Ji, an...

  23. [23]

    arXiv preprint arXiv:2603.03800

    A rubric-supervised critic from sparse real-world outcomes. arXiv preprint arXiv:2603.03800. Xingyao Wang, Boxuan Li, Y ufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Y ueqi Song, Bowen Li, Jaskirat Singh, and 1 others. 2025b. Openhands: An open platform for ai software developers as generalist agents. In International Conference on Learn...

  24. [24]

    arXiv preprint arXiv:2510.22775

    Scalable supervising software agents with patch reasoner. arXiv preprint arXiv:2510.22775. John Y ang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Y ao, Karthik Narasimhan, and Ofir Press

  25. [25]

    arXiv preprint arXiv:2509.23045

    Kimi-dev: Agentless training as skill prior for swe-agents. arXiv preprint arXiv:2509.23045. Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiaochen Zuo, Y u Y ue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others

  26. [26]

    arXiv preprint arXiv:2308.01825

    Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825. Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Li Aoyan, Lu Chen, Xiaojian Zhong, and 1 others

  27. [27]

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, and 1 others. 2026a. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763. Wenhao Zeng, Y aoning Wang, Chao Hu, Y uling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu

  28. [28]

    arXiv preprint arXiv:2508.05988

    Pruning the unsurprising: Efficient code reasoning via first-token surprisal. arXiv preprint arXiv:2508.05988. Wenhao Zeng, Xuteng Zhang, Y uling Shi, Chao Hu, Y uting Chen, Beijun Shen, and Xiaodong Gu. 2026b. Glimprouter: Efficient collaborative inference by glimpsing one token of thoughts. arXiv preprint arXiv:2601.05110. Jiazheng Zhang, Ziche Fu, Zhiheng...

  29. [29]

    These datasets are disjoint from our verifier evaluation benchmark built from SWE-bench V erified and Multi-SWE- bench Flash

    and Multi-SWE-RL (Zan et al., 2026), with r⋆ ∈ { 0, 1} being the verdict obtained from running the held-out unit tests on the candidate patch. These datasets are disjoint from our verifier evaluation benchmark built from SWE-bench V erified and Multi-SWE- bench Flash. For each source example, a strong frontier teacher model (GLM-5) proposes one or more cand...