Recognition: unknown
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
Pith reviewed 2026-05-08 08:59 UTC · model grok-4.3
The pith
Coding agents reach a shared ceiling of 45-49% F1 when identifying which tests need updates after real code commits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TEBench shows that all evaluated configurations converge on an identification F1 between 45.7% and 49.4%, exposing a performance ceiling independent of framework or base model, while Test-Stale cases average only around 36% F1 because agents follow a reactive execute-fail-fix loop that cannot address tests that still pass yet no longer validate the intended behavior.
What carries the argument
The four-stage pipeline that selects Defects4J commits, extracts project-level task instances, annotates each with one or more of Test-Breaking, Test-Stale, or Test-Missing labels, and attaches developer-written ground-truth patches.
If this is right
- Reactive loops based on test execution will continue to fail on stale and missing tests even if models improve.
- Benchmarks limited to method-level or pre-paired inputs hide the location and new-test problems that dominate real maintenance.
- Surface-executable patches do not guarantee semantic alignment with how developers actually update tests.
- Closing the gap on Test-Stale will require explicit mechanisms for reasoning about code intent rather than failure signals.
Where Pith is reading between the lines
- Future agents could benefit from separate passes that first summarize changed behavior before attempting test edits.
- Extending TEBench to multi-commit histories might reveal whether the observed ceiling persists or compounds over time.
- The divergence between executable patches and ground truth suggests that test-generation metrics should also measure intent preservation.
Load-bearing premise
The 314 curated tasks from ten Defects4J projects are representative of typical test-evolution needs and carry accurate, unbiased ground-truth annotations.
What would settle it
An agent that exceeds 60% F1 on the same TEBench identification tasks while producing patches whose surface form and intent match developer ground truth more closely than current outputs would falsify the shared-ceiling claim.
Figures
read the original abstract
As production code evolves, the test suite must co-evolve to remain effective. Existing benchmarks for test evolution operate at method-level granularity with pre-paired inputs, bypassing the task of locating affected tests from the full project and excluding the need for new tests entirely. We present TEBench, the first project-level benchmark for test evolution. Given a project repository and a code-changing commit, TEBench requires systems to autonomously identify tests requiring modification, determine where new tests are needed, and produce the corresponding test patch. We construct TEBench through a four-stage pipeline over Defects4J projects, curating 314 task instances from 10 projects with developer-written ground truth. Each instance is annotated with one or more of three evolution types: Test-Breaking (tests that fail), Test-Stale (tests that pass but no longer meaningfully validate updated behavior), and Test-Missing (new tests needed for introduced behavior). We evaluate seven configurations spanning three industrial agent frameworks (Claude Code, Codex CLI, OpenCode) and six base models, alongside a heuristic baseline. All seven configurations converge on an identification F1 of 45.7% to 49.4%, revealing a shared performance ceiling across both frameworks and base models. Test-Stale is the most challenging type, averaging F1 around 36%, since configurations rely on execution failure signals and lack proactive semantic reasoning. On the update task, configurations produce highly executable test modifications whose surface form diverges substantially from ground truth. Trajectory analysis reveals a reactive "execute-fail-fix" loop that succeeds for breaking tests but structurally cannot address stale or missing tests. TEBench is available at https://github.com/iSEngLab/TEBench with a leaderboard at https://tebench-leadership.vercel.app.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TEBench, the first project-level benchmark for test evolution. It uses a four-stage pipeline over 10 Defects4J projects to curate 314 task instances, each tied to a code-changing commit and annotated with developer-written ground truth for one or more of three types: Test-Breaking (failing tests), Test-Stale (passing but semantically outdated tests), and Test-Missing (new tests required for added behavior). Seven agent configurations spanning three frameworks and six base models are evaluated on identifying tests to update and generating patches; all converge to identification F1 scores of 45.7–49.4%, with Test-Stale hardest (~36% F1). The work attributes the ceiling to agents' reactive execute-fail-fix behavior, which cannot address semantic cases, and releases the benchmark with a leaderboard.
Significance. If the ground-truth annotations hold, the results establish a reproducible performance ceiling for current coding agents on project-level test co-evolution and isolate the specific failure mode for semantic (non-execution) cases. The public release of TEBench, the 314-instance dataset, and the leaderboard constitute a concrete, falsifiable resource that can drive targeted improvements in agent semantic reasoning and proactive test maintenance. These strengths outweigh the current empirical gaps.
major comments (1)
- Benchmark construction (four-stage pipeline over Defects4J): The central claim of a shared 45.7–49.4% identification F1 ceiling and the diagnosis that Test-Stale is hardest (~36% F1) because agents lack semantic reasoning both rest on the accuracy and lack of bias in the developer-written labels for Test-Stale and Test-Missing. The manuscript supplies no inter-annotator agreement, label-distribution statistics, or independent validation that the semantic judgments (e.g., “no longer meaningfully validates updated behavior”) are consistent and representative. If the curation stages systematically over- or under-label stale/missing cases, the observed type-specific gap and the “reactive loop” conclusion become artifacts of the benchmark rather than intrinsic agent limitations.
minor comments (2)
- Abstract and evaluation setup: The abstract reports convergence across “seven configurations” and “six base models” without naming them; readers must reach the experimental section to learn the exact frameworks (Claude Code, Codex CLI, OpenCode) and models. Adding a compact table or parenthetical list in the abstract would improve accessibility.
- Trajectory analysis: The claim that agents follow an “execute-fail-fix” loop is supported by qualitative examples, but no quantitative breakdown (e.g., percentage of trajectories that terminate without addressing stale/missing tests) is provided. A small table summarizing loop statistics per type would strengthen the evidence.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript introducing TEBench. We address the major comment regarding benchmark construction below, providing clarifications and committing to revisions where appropriate.
read point-by-point responses
-
Referee: Benchmark construction (four-stage pipeline over Defects4J): The central claim of a shared 45.7–49.4% identification F1 ceiling and the diagnosis that Test-Stale is hardest (~36% F1) because agents lack semantic reasoning both rest on the accuracy and lack of bias in the developer-written labels for Test-Stale and Test-Missing. The manuscript supplies no inter-annotator agreement, label-distribution statistics, or independent validation that the semantic judgments (e.g., “no longer meaningfully validates updated behavior”) are consistent and representative. If the curation stages systematically over- or under-label stale/missing cases, the observed type-specific gap and the “reactive loop” conclusion become artifacts of the benchmark rather than intrinsic agent limitations.
Authors: We appreciate the referee's emphasis on the reliability of our ground-truth annotations. The labels in TEBench are derived from actual developer-written changes in the Defects4J commit history: Test-Breaking tests are those that fail after the code change and were updated in the commit; Test-Stale tests pass execution but were nevertheless modified by developers, reflecting semantic staleness; and Test-Missing cases involve new tests added for the introduced behavior. This approach grounds the annotations in observable developer actions rather than subjective post-hoc judgments. We will revise the manuscript to include label-distribution statistics (e.g., the proportion of each type across the 314 instances) and expand the description of the four-stage pipeline to include details on how potential biases were mitigated during curation. Regarding inter-annotator agreement, because the ground truth is extracted directly from commit diffs and developer modifications rather than independent multi-annotator labeling, traditional IAA metrics do not apply. We will clarify this in the revised version and note it as a characteristic of our benchmark construction. revision: partial
Circularity Check
No circularity: results are direct empirical measurements on independently curated benchmark
full rationale
The paper's core results (agent F1 scores of 45.7–49.4% with Test-Stale at ~36%) are obtained by executing seven agent configurations on the 314 TEBench instances and comparing outputs against the developer-written ground-truth annotations produced by the four-stage Defects4J pipeline. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations are invoked to derive these numbers; they are straightforward evaluation metrics. The benchmark construction itself is a one-time curation step whose validity is an external assumption, not a derivation that reduces to the reported F1 values by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Defects4J projects and commits provide representative and diverse examples of real-world code changes requiring test evolution
Reference graph
Works this paper leans on
-
[1]
Ahmed, T., Hirzel, M., Pan, R., Shinnar, A., and Sinha, S. TDD-Bench verified: Can LLMs generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883,
-
[2]
E., Narasimhan, K., and Press, O
Chowdhury, N., Aider, J., Cassano, F., Zhuo, J., Liu, Q., Jimenez, C. E., Narasimhan, K., and Press, O. SWE- Bench verified: A stricter evaluation for AI software engineering.arXiv preprint arXiv:2406.12952,
-
[3]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556,
work page internal anchor Pith review arXiv
-
[4]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Deng, X., Da, J., Pan, E., He, Y . Y ., Ide, C., Garg, K., Lauffer, N., Park, A., Pasari, N., Rane, C., et al. SWE- Bench pro: Can AI agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941,
work page internal anchor Pith review arXiv
-
[5]
GLM-5: from Vibe Coding to Agentic Engineering
GLM. GLM-5: from vibe coding to agentic engineering. CoRR, abs/2602.15763,
work page internal anchor Pith review arXiv
-
[6]
Just, R., Jalali, D., and Ernst, M
URL https://openreview.net/ forum?id=VTF8yNQM66. Just, R., Jalali, D., and Ernst, M. D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA), pp. 437–440,
2014
-
[7]
Kimi K2.5: Visual Agentic Intelligence
Team, K. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276,
work page internal anchor Pith review arXiv
-
[8]
Team, Q. Qwen3 technical report.CoRR, abs/2505.09388,
work page internal anchor Pith review arXiv
-
[9]
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Thai, M. V ., Le, T., Manh, D. N., Nhat, H. P., and Bui, N. D. SWE-Evo: Benchmarking coding agents in long- horizon software evolution scenarios.arXiv preprint arXiv:2512.18470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Wang, W., Yang, C., Wang, Z., Huang, Y ., Chu, Z., Song, D., Zhang, L., Chen, A. R., and Ma, L. TestEval: Bench- marking large language models for test case generation. InFindings of the Association for Computational Linguis- tics: NAACL 2025, pp. 3547–3562, 2025a. Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Si...
-
[11]
Yang, C., Chen, J., Lin, B., Zhou, J., and Wang, Z. Enhanc- ing LLM-based test generation for hard-to-cover branches via program analysis.arXiv preprint arXiv:2404.04966, 2024a. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE-Agent: Agent- computer interfaces enable automated software engineer- ing.Advances in ...
-
[12]
Zhang, Y ., Yang, Z., Pan, S., and Liu, Z. Unit test up- date through LLM-driven context collection and error- type-aware refinement.arXiv preprint arXiv:2509.24419,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.