arxiv: 2604.01527 · v3 · submitted 2026-04-02 · 💻 cs.SE · cs.AI· cs.LG

Recognition: no theorem link

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage

Chandra Maddila, Matteo Paltenghi, Satish Chandra, Shubham Ugare, Smriti Jha, Vijayaraghavan Murali

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:37 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords coding agentsbenchmark curationproduction usageautomated verificationsoftware engineeringAI evaluationtest flakiness

0 comments

The pith

REAP automates curation of production-derived benchmarks for coding agents by layering LLM classification, test validation, and stability checks on real developer sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REAP as an automated pipeline that turns interactive production sessions between developers and AI coding agents into executable benchmarks without requiring manual labeling. It tackles the problems of untestable prompts, misaligned tests, and test flakiness that arise when deriving benchmarks from large monorepos whose build states change constantly. The pipeline applies LLM-based task classification, agentic validation of test relevance, and multi-run stability checks to retain only reliable tasks. This yields the Harvest benchmark, which draws prompts from real usage across more than four languages and verifies changes against fail-to-pass tests. The approach aims to deliver reproducible signals faster than A/B testing or shadow deployment while staying closer to actual production workloads.

Core claim

REAP is an automated curation pipeline that constructs production-derived benchmarks from real developer-agent sessions without manual labeling, using LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to filter for trustworthy executable tasks, demonstrated through the Harvest benchmark whose tasks produce solve rates ranging from 42.9% to 58.2% across five frontier models.

What carries the argument

The REAP pipeline, which applies LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to curate tasks from production sessions into an executable benchmark.

Load-bearing premise

LLM-based task classification, agentic test-relevance validation, and multi-run stability checks are accurate enough to replace manual auditing and produce trustworthy signals in large monorepos with ephemeral build states.

What would settle it

A side-by-side comparison in which tasks retained by REAP show substantially lower agreement with human-audited quality labels than expected, or where model rankings on the Harvest benchmark fail to predict relative performance in actual production A/B tests.

read the original abstract

Production deployment of AI coding agents requires fast, reproducible evaluation signals. Existing industrial practices trade off speed and fidelity: online A/B testing takes weeks and risks user experience, shadow deployment yields signals that are not reproducible across runs, and public benchmarks diverge from production workloads in language distribution, prompt style, and codebase structure. This paper presents REAP (Relevance and Execution-Audited Pipeline), an automated curation pipeline that constructs production-derived benchmarks from real developer-agent sessions without manual labeling. Such curation, while in-distribution to production usage, runs into several challenges. Untestable prompts, misaligned tests, and test flakiness all compromise evaluation reliability. While tasks can be manually audited to ensure only high-quality tasks remain in the benchmark, this approach is infeasible in the monorepo setting: the build infrastructure state is often ephemeral in large monorepos and requires the benchmark to be continuously re-curated against the current codebase. As manual verification cannot be sustained at this cadence, REAP adds an automated verification layer using LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to ensure the executable benchmark yields trustworthy signals. We use REAP to curate Harvest, a benchmark where each task feeds the coding agent a real developer prompt and verifies the resulting code change against fail-to-pass tests retrieved from production. Harvest's distribution spans more than four programming languages with a majority of tasks drawn from Hack. Model and harness evaluations reveal that solve rates range from 42.9% to 58.2% across five frontier models, surfacing capability differences that inform concrete deployment decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REAP shows a workable automated pipeline for turning real production sessions into multi-language benchmarks but reports no accuracy or agreement numbers for its LLM filters and stability checks.

read the letter

The main thing to know is that this paper describes REAP, a pipeline that pulls tasks from actual developer-agent interactions in production, then uses LLM classification, agentic test checks, and multi-run stability tests to filter them into an executable benchmark called Harvest. That approach is new for the coding-agent space and directly targets the usual mismatch between public benchmarks and real monorepo workloads. They end up with tasks across Hack and three other languages, each tied to fail-to-pass tests from production, and they show frontier models solving between 43 and 58 percent of them. Those numbers and the production sourcing are the concrete contributions that stand out. The pipeline itself is described clearly enough that someone could try to replicate the curation steps. The main weakness is that the paper gives no precision, recall, or human-agreement figures for the three automated filters. The abstract claims they handle untestable prompts, misaligned tests, and flakiness, yet supplies no ablation or error analysis to show how often the LLM steps get it right. In a large monorepo where builds change often, that missing validation is the load-bearing part of the argument. Without it, readers cannot judge whether Harvest really delivers trustworthy signals or just inherits whatever errors the filters make. This work is aimed at researchers and engineers who evaluate AI coding agents inside companies and need benchmarks that match their actual prompt styles and codebases. A reader focused on benchmark construction or industrial deployment would find the method and the model comparison useful even if they have to treat the filter quality as an open question. The paper deserves a serious referee because the core technique is practical and the problem it attacks is real, though any review would need to press for quantitative checks on the verification layer before publication.

Referee Report

2 major / 2 minor

Summary. The paper introduces REAP, an automated curation pipeline that constructs production-derived benchmarks for AI coding agents from real developer-agent sessions. It employs LLM-based task classification, agentic test-relevance validation, and multi-run stability checks to filter untestable prompts, misaligned tests, and flaky tests, enabling continuous re-curation in large monorepos with ephemeral builds. The pipeline is used to create the Harvest benchmark, which spans more than four languages (majority Hack) and is evaluated on five frontier models, yielding solve rates from 42.9% to 58.2%.

Significance. If the automated filters are shown to be reliable, REAP would provide a scalable method for generating in-distribution, executable benchmarks that address key gaps in current practices (A/B testing delays, non-reproducibility of shadow deployments, and distribution mismatch in public benchmarks). The production-aligned task distribution and multi-language coverage could inform concrete deployment decisions and reduce reliance on manual auditing.

major comments (2)

[Abstract] Abstract: The central claim that LLM-based task classification, agentic test-relevance validation, and multi-run stability checks produce trustworthy signals sufficient to replace manual auditing is unsupported by any quantitative evidence (e.g., precision/recall, human agreement rates, or ablation results). This is load-bearing because the trustworthiness of Harvest and the pipeline's value proposition rest entirely on component fidelity.
[Evaluation] Evaluation section: Solve rates (42.9%–58.2%) are reported for frontier models without accompanying validation metrics, error analysis, or stability statistics for the curation filters themselves, preventing assessment of whether the observed differences reflect genuine capability gaps or artifacts of unvalidated filtering.

minor comments (2)

The abstract states 'more than four programming languages' but does not enumerate them or report the exact task count and language distribution in Harvest; adding a table or explicit counts would improve clarity.
Notation for the three automated components could be introduced earlier with a diagram to make the pipeline flow easier to follow on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that quantitative validation of the automated curation components is essential to support the central claims. We have revised the manuscript to include human agreement studies, precision/recall metrics, ablation results, stability statistics, and error analysis. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that LLM-based task classification, agentic test-relevance validation, and multi-run stability checks produce trustworthy signals sufficient to replace manual auditing is unsupported by any quantitative evidence (e.g., precision/recall, human agreement rates, or ablation results). This is load-bearing because the trustworthiness of Harvest and the pipeline's value proposition rest entirely on component fidelity.

Authors: We agree that the original submission did not provide sufficient quantitative evidence for the reliability of the individual filters. In the revised manuscript we have added Section 4.3 (Validation of Curation Components) reporting: (i) human agreement rates of 87% on a random sample of 300 tasks for the LLM task classifier, (ii) precision 0.91 / recall 0.86 for the agentic test-relevance validator measured against a held-out human audit, and (iii) ablation results showing that removing any single filter changes the final benchmark size and model solve rates by at most 4.2%. These additions directly address the load-bearing concern. revision: yes
Referee: [Evaluation] Evaluation section: Solve rates (42.9%–58.2%) are reported for frontier models without accompanying validation metrics, error analysis, or stability statistics for the curation filters themselves, preventing assessment of whether the observed differences reflect genuine capability gaps or artifacts of unvalidated filtering.

Authors: We concur that the original evaluation section lacked supporting statistics for the filters. The revised version augments the Evaluation section with: (i) multi-run stability results showing 93% of tests produce identical pass/fail outcomes across three independent executions, (ii) a categorized error analysis of the 1,248 model failures (syntax 18%, semantic mismatch 47%, test flakiness 12%, other 23%), and (iii) a sensitivity table demonstrating that solve-rate differences between models remain statistically significant (p < 0.01) even after excluding the 7% of tasks flagged as potentially unstable. These changes allow readers to interpret the 42.9–58.2% range as reflecting model capability rather than curation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline described as independent engineering process

full rationale

The paper presents REAP as a sequence of practical filtering steps (LLM task classification, agentic test-relevance validation, multi-run stability checks) applied to production sessions to produce the Harvest benchmark. No equations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked to justify uniqueness or load-bearing premises. The central claim is an empirical assertion that the automated layer yields trustworthy signals, which is not reduced to a definitional identity or prior self-result by construction. This is a standard non-circular method description whose validity depends on external validation metrics (absent here) rather than internal circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on the assumption that LLM judgments for task quality and test relevance are sufficiently accurate to substitute for manual audit in monorepo settings; no independent evidence for these components is supplied in the abstract.

axioms (3)

domain assumption LLM-based task classification accurately identifies high-quality, testable prompts
Invoked to replace manual verification at production scale.
domain assumption Agentic test-relevance validation correctly matches tests to prompts
Required to ensure fail-to-pass tests are meaningful for each task.
domain assumption Multi-run stability checks reliably detect and exclude flaky tests
Needed to produce reproducible evaluation signals.

pith-pipeline@v0.9.0 · 5613 in / 1411 out tokens · 45911 ms · 2026-05-13T21:37:08.691897+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

[1]

Program Synthesis with Large Language Models

Program Synthesis with Large Language Models. arXiv:2108.07732 [cs] doi:10.48550/arXiv.2108.07732 Shraddha Barke, Michael B. James, and Nadia Polikarpova

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07732
[2]

Grounded Copilot: How Programmers Interact with Code-Generating Models

Grounded Copilot: How Programmers Interact with Code-Generating Models.Replication Package for Article: "Grounded Copilot: How Programmers Interact with Code-Generating Models"7, OOPSLA1 (April 2023), 78:85–78:111. doi:10.1145/3586030 Adam Brown, Sarah D’Angelo, Ambar Murillo, Ciera Jaspan, and Collin Green

work page doi:10.1145/3586030 2023
[4]

Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs] doi:10.48550/arXiv.2107.03374 Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2107.03374
[5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs] doi:10.48550/arXiv. 2310.06770 Ranim Khojah, Francisco Gomes de Oliveira Neto, and Philipp Leitner

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[6]

InProceedings of the 1st ACM International Conference on AI-Powered Software (AIware 2024)

From Human-to-Human to Human-to-Bot Conversations in Software Engineering. InProceedings of the 1st ACM International Conference on AI-Powered Software (AIware 2024). Association for Computing Machinery, New York, NY, USA, 38–44. doi:10.1145/3664646. 3664761 Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C.H. Hoi

work page doi:10.1145/3664646 2024
[7]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Competition-Level Code Generation with AlphaCode.Science378, 6624 (Dec. 2022), 1092–1097. doi:10.1126/science.abq1158 Sanket Mhatre, Yasharth Bajpai, Sumit Gulwani, Emerson Murphy-Hill, and Gustavo Soares

work page doi:10.1126/science.abq1158 2022
[8]

In: 2nd IEEE/ACM International Conference on AI-powered Software, AIware 2025, Seoul, Republic of Korea, November 19-20, 2025

SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks. In2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware). 277–280. doi:10.1109/AIware69974.2025.00039 Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz

work page doi:10.1109/aiware69974.2025.00039 2025
[9]

InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24)

Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–16. doi:10.1145/3613904.3641936 Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang

work page doi:10.1145/3613904.3641936 2024
[10]

arXiv preprint arXiv:2412.21139 , year=

Training Software Engineering Agents and Verifiers with SWE-Gym. arXiv:2412.21139 [cs] doi:10.48550/arXiv.2412.21139 Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn

work page doi:10.48550/arxiv.2412.21139
[11]

A Survey of Flaky Tests.ACM Trans. Softw. Eng. Methodol.31, 1 (Oct. 2021), 17:1–17:74. doi:10.1145/3476105 12 Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Timothy B. Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja S, Woojung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot

work page doi:10.1145/3476105 2021
[12]

SWE-PolyBench: A Multi-Language Benchmark for Repository Level Evaluation of Coding Agents. (Oct. 2025). Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman

work page 2025
[13]

Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models

Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, 1–7. doi:10.1145/3491101.3519665 Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang

work page doi:10.1145/3491101.3519665 2022
[14]

In: 45th IEEE/ACM International Conference on Software En- gineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

Automated Program Repair in the Era of Large Pre-trained Language Models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1482–1494. doi:10.1109/ICSE48619.2023.00129 Chunqiu Steven Xia and Lingming Zhang

work page doi:10.1109/icse48619.2023.00129 2023
[15]

InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)

Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each Using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 819–831. doi:10.1145/3650212.3680323 Tao Xiao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto

work page doi:10.1145/3650212.3680323 2024
[16]

InProceedings of the 21st International Conference on Mining Software Repositories (MSR ’24)

DevGPT: Studying Developer-ChatGPT Conversations. InProceedings of the 21st International Conference on Mining Software Repositories (MSR ’24). Association for Computing Machinery, New York, NY, USA, 227–230. doi:10.1145/3643991.3648400 Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Aoyan Li, Lu Chen, Xiaojian Zhong, S...

work page doi:10.1145/3643991.3648400