Recognition: no theorem link
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
Pith reviewed 2026-05-13 21:37 UTC · model grok-4.3
The pith
REAP automates curation of production-derived benchmarks for coding agents by layering LLM classification, test validation, and stability checks on real developer sessions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
REAP is an automated curation pipeline that constructs production-derived benchmarks from real developer-agent sessions without manual labeling, using LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to filter for trustworthy executable tasks, demonstrated through the Harvest benchmark whose tasks produce solve rates ranging from 42.9% to 58.2% across five frontier models.
What carries the argument
The REAP pipeline, which applies LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to curate tasks from production sessions into an executable benchmark.
Load-bearing premise
LLM-based task classification, agentic test-relevance validation, and multi-run stability checks are accurate enough to replace manual auditing and produce trustworthy signals in large monorepos with ephemeral build states.
What would settle it
A side-by-side comparison in which tasks retained by REAP show substantially lower agreement with human-audited quality labels than expected, or where model rankings on the Harvest benchmark fail to predict relative performance in actual production A/B tests.
read the original abstract
Production deployment of AI coding agents requires fast, reproducible evaluation signals. Existing industrial practices trade off speed and fidelity: online A/B testing takes weeks and risks user experience, shadow deployment yields signals that are not reproducible across runs, and public benchmarks diverge from production workloads in language distribution, prompt style, and codebase structure. This paper presents REAP (Relevance and Execution-Audited Pipeline), an automated curation pipeline that constructs production-derived benchmarks from real developer-agent sessions without manual labeling. Such curation, while in-distribution to production usage, runs into several challenges. Untestable prompts, misaligned tests, and test flakiness all compromise evaluation reliability. While tasks can be manually audited to ensure only high-quality tasks remain in the benchmark, this approach is infeasible in the monorepo setting: the build infrastructure state is often ephemeral in large monorepos and requires the benchmark to be continuously re-curated against the current codebase. As manual verification cannot be sustained at this cadence, REAP adds an automated verification layer using LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to ensure the executable benchmark yields trustworthy signals. We use REAP to curate Harvest, a benchmark where each task feeds the coding agent a real developer prompt and verifies the resulting code change against fail-to-pass tests retrieved from production. Harvest's distribution spans more than four programming languages with a majority of tasks drawn from Hack. Model and harness evaluations reveal that solve rates range from 42.9% to 58.2% across five frontier models, surfacing capability differences that inform concrete deployment decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces REAP, an automated curation pipeline that constructs production-derived benchmarks for AI coding agents from real developer-agent sessions. It employs LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to filter untestable prompts, misaligned tests, and flaky tests, enabling continuous re-curation in large monorepos with ephemeral builds. The pipeline is used to create the Harvest benchmark, which spans more than four languages (majority Hack) and is evaluated on five frontier models, yielding solve rates from 42.9% to 58.2%.
Significance. If the automated filters are shown to be reliable, REAP would provide a scalable method for generating in-distribution, executable benchmarks that address key gaps in current practices (A/B testing delays, non-reproducibility of shadow deployments, and distribution mismatch in public benchmarks). The production-aligned task distribution and multi-language coverage could inform concrete deployment decisions and reduce reliance on manual auditing.
major comments (2)
- [Abstract] Abstract: The central claim that LLM-based task classification, agentic test-relevance validation, and multi-run stability checks produce trustworthy signals sufficient to replace manual auditing is unsupported by any quantitative evidence (e.g., precision/recall, human agreement rates, or ablation results). This is load-bearing because the trustworthiness of Harvest and the pipeline's value proposition rest entirely on component fidelity.
- [Evaluation] Evaluation section: Solve rates (42.9%–58.2%) are reported for frontier models without accompanying validation metrics, error analysis, or stability statistics for the curation filters themselves, preventing assessment of whether the observed differences reflect genuine capability gaps or artifacts of unvalidated filtering.
minor comments (2)
- The abstract states 'more than four programming languages' but does not enumerate them or report the exact task count and language distribution in Harvest; adding a table or explicit counts would improve clarity.
- Notation for the three automated components could be introduced earlier with a diagram to make the pipeline flow easier to follow on first reading.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We agree that quantitative validation of the automated curation components is essential to support the central claims. We have revised the manuscript to include human agreement studies, precision/recall metrics, ablation results, stability statistics, and error analysis. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that LLM-based task classification, agentic test-relevance validation, and multi-run stability checks produce trustworthy signals sufficient to replace manual auditing is unsupported by any quantitative evidence (e.g., precision/recall, human agreement rates, or ablation results). This is load-bearing because the trustworthiness of Harvest and the pipeline's value proposition rest entirely on component fidelity.
Authors: We agree that the original submission did not provide sufficient quantitative evidence for the reliability of the individual filters. In the revised manuscript we have added Section 4.3 (Validation of Curation Components) reporting: (i) human agreement rates of 87% on a random sample of 300 tasks for the LLM task classifier, (ii) precision 0.91 / recall 0.86 for the agentic test-relevance validator measured against a held-out human audit, and (iii) ablation results showing that removing any single filter changes the final benchmark size and model solve rates by at most 4.2%. These additions directly address the load-bearing concern. revision: yes
-
Referee: [Evaluation] Evaluation section: Solve rates (42.9%–58.2%) are reported for frontier models without accompanying validation metrics, error analysis, or stability statistics for the curation filters themselves, preventing assessment of whether the observed differences reflect genuine capability gaps or artifacts of unvalidated filtering.
Authors: We concur that the original evaluation section lacked supporting statistics for the filters. The revised version augments the Evaluation section with: (i) multi-run stability results showing 93% of tests produce identical pass/fail outcomes across three independent executions, (ii) a categorized error analysis of the 1,248 model failures (syntax 18%, semantic mismatch 47%, test flakiness 12%, other 23%), and (iii) a sensitivity table demonstrating that solve-rate differences between models remain statistically significant (p < 0.01) even after excluding the 7% of tasks flagged as potentially unstable. These changes allow readers to interpret the 42.9–58.2% range as reflecting model capability rather than curation artifacts. revision: yes
Circularity Check
No circularity: pipeline described as independent engineering process
full rationale
The paper presents REAP as a sequence of practical filtering steps (LLM task classification, agentic test-relevance validation, multi-run stability checks) applied to production sessions to produce the Harvest benchmark. No equations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked to justify uniqueness or load-bearing premises. The central claim is an empirical assertion that the automated layer yields trustworthy signals, which is not reduced to a definitional identity or prior self-result by construction. This is a standard non-circular method description whose validity depends on external validation metrics (absent here) rather than internal circular reduction.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption LLM-based task classification accurately identifies high-quality, testable prompts
- domain assumption Agentic test-relevance validation correctly matches tests to prompts
- domain assumption Multi-run stability checks reliably detect and exclude flaky tests
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Program Synthesis with Large Language Models. arXiv:2108.07732 [cs] doi:10.48550/arXiv.2108.07732 Shraddha Barke, Michael B. James, and Nadia Polikarpova
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732
-
[2]
Grounded Copilot: How Programmers Interact with Code-Generating Models
Grounded Copilot: How Programmers Interact with Code-Generating Models.Replication Package for Article: "Grounded Copilot: How Programmers Interact with Code-Generating Models"7, OOPSLA1 (April 2023), 78:85–78:111. doi:10.1145/3586030 Adam Brown, Sarah D’Angelo, Ambar Murillo, Ciera Jaspan, and Collin Green
-
[4]
Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs] doi:10.48550/arXiv.2107.03374 Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374
-
[5]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs] doi:10.48550/arXiv. 2310.06770 Ranim Khojah, Francisco Gomes de Oliveira Neto, and Philipp Leitner
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[6]
InProceedings of the 1st ACM International Conference on AI-Powered Software (AIware 2024)
From Human-to-Human to Human-to-Bot Conversations in Software Engineering. InProceedings of the 1st ACM International Conference on AI-Powered Software (AIware 2024). Association for Computing Machinery, New York, NY, USA, 38–44. doi:10.1145/3664646. 3664761 Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C.H. Hoi
-
[7]
Competition-Level Code Generation with AlphaCode.Science378, 6624 (Dec. 2022), 1092–1097. doi:10.1126/science.abq1158 Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, and Gustavo Soares
-
[8]
SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). 277–280. doi:10.1109/AIware69974.2025.00039 Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz
-
[9]
InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24)
Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–16. doi:10.1145/3613904.3641936 Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang
-
[10]
arXiv preprint arXiv:2412.21139 , year=
Training Software Engineering Agents and Verifiers with SWE-Gym. arXiv:2412.21139 [cs] doi:10.48550/arXiv.2412.21139 Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn
-
[11]
A Survey of Flaky Tests.ACM Trans. Softw. Eng. Methodol.31, 1 (Oct. 2021), 17:1–17:74. doi:10.1145/3476105 12 Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Timothy B. Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja S, Woojung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot
-
[12]
SWE-PolyBench: A Multi-Language Benchmark for Repository Level Evaluation of Coding Agents. (Oct. 2025). Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman
work page 2025
-
[13]
Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models
Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, 1–7. doi:10.1145/3491101.3519665 Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang
-
[14]
Automated Program Repair in the Era of Large Pre-trained Language Models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1482–1494. doi:10.1109/ICSE48619.2023.00129 Chunqiu Steven Xia and Lingming Zhang
-
[15]
Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each Using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 819–831. doi:10.1145/3650212.3680323 Tao Xiao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto
-
[16]
InProceedings of the 21st International Conference on Mining Software Repositories (MSR ’24)
DevGPT: Studying Developer-ChatGPT Conversations. InProceedings of the 21st International Conference on Mining Software Repositories (MSR ’24). Association for Computing Machinery, New York, NY, USA, 227–230. doi:10.1145/3643991.3648400 Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Aoyan Li, Lu Chen, Xiaojian Zhong, S...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.