pith. machine review for the scientific record. sign in

arxiv: 2605.06125 · v1 · submitted 2026-05-07 · 💻 cs.SE

Recognition: unknown

Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:59 UTC · model grok-4.3

classification 💻 cs.SE
keywords test evolutioncoding agentsproject-level benchmarksoftware testingtest maintenanceDefects4JAI agents
0
0 comments X

The pith

Coding agents reach a shared ceiling of 45-49% F1 when identifying which tests need updates after real code commits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

As code changes, tests must be located, revised, or added to keep covering the new behavior, yet prior benchmarks avoided these steps by using pre-paired method-level examples. TEBench supplies 314 project-level tasks drawn from Defects4J commits, each requiring an agent to scan the full repository, classify tests into breaking, stale, or missing categories, and emit a patch. Seven agent configurations built on three frameworks and multiple models all converge on nearly identical identification accuracy, with stale tests proving hardest because agents depend on runtime failures rather than understanding what the updated code should do. The resulting test patches execute well but diverge from developer ground truth in structure and intent.

Core claim

TEBench shows that all evaluated configurations converge on an identification F1 between 45.7% and 49.4%, exposing a performance ceiling independent of framework or base model, while Test-Stale cases average only around 36% F1 because agents follow a reactive execute-fail-fix loop that cannot address tests that still pass yet no longer validate the intended behavior.

What carries the argument

The four-stage pipeline that selects Defects4J commits, extracts project-level task instances, annotates each with one or more of Test-Breaking, Test-Stale, or Test-Missing labels, and attaches developer-written ground-truth patches.

If this is right

  • Reactive loops based on test execution will continue to fail on stale and missing tests even if models improve.
  • Benchmarks limited to method-level or pre-paired inputs hide the location and new-test problems that dominate real maintenance.
  • Surface-executable patches do not guarantee semantic alignment with how developers actually update tests.
  • Closing the gap on Test-Stale will require explicit mechanisms for reasoning about code intent rather than failure signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future agents could benefit from separate passes that first summarize changed behavior before attempting test edits.
  • Extending TEBench to multi-commit histories might reveal whether the observed ceiling persists or compounds over time.
  • The divergence between executable patches and ground truth suggests that test-generation metrics should also measure intent preservation.

Load-bearing premise

The 314 curated tasks from ten Defects4J projects are representative of typical test-evolution needs and carry accurate, unbiased ground-truth annotations.

What would settle it

An agent that exceeds 60% F1 on the same TEBench identification tasks while producing patches whose surface form and intent match developer ground truth more closely than current outputs would falsify the shared-ceiling claim.

Figures

Figures reproduced from arXiv: 2605.06125 by Chunrong Fang, Haichuan Hu, Liang Xiao, Quanjun Zhang, Ye Shang, Zhenyu Chen.

Figure 1
Figure 1. Figure 1: Motivating example from JSOUP: a change to isInlineable() in Element.java impacts multiple test files across different packages, exhibiting three types of test evolution. when preceded by blank text nodes. The same failure pattern also appears in the other two tests across different packages. Impact 2: Silent quality degradation. Other tests con￾tinue to pass, yet become semantically outdated. In partic￾ul… view at source ↗
Figure 2
Figure 2. Figure 2: Task construction pipeline. Numbers indicate the remain￾ing commits after each stage. compression, mathematical computation, chart rendering, and general-purpose language utilities, yielding a total of 67,670 commits. 2.1.2. STATIC FILTERING We applied a series of static filters to narrow the candidate set without requiring code execution. First, we restricted the time range to commits after January 2019 t… view at source ↗
Figure 3
Figure 3. Figure 3: Three-version structure and its dual role in classification and evaluation. evaluation, V−0.5 is the project state presented to the coding agent. The agent is informed via its prompt that the most re￾cent commit modified the production code without updating the test suite, and is tasked with identifying and updating any affected tests. The agent can access all project-level information, including commit hi… view at source ↗
Figure 4
Figure 4. Figure 4: Unified task prompt provided to all configurations. JaCoCo coverage analysis to assess whether changed pro￾duction code is adequately exercised, since passing tests alone may mask insufficient coverage of newly introduced behavior. The Termination Conditions component defines explicit stopping criteria to prevent configurations from en￾tering infinite repair loops, allowing them to stop when tests pass wit… view at source ↗
Figure 5
Figure 5. Figure 5: Case study on Task 293 (jsoup): GT changes (top) versus actual modifications produced by each configuration (bottom). and a blank space before the inline element. In contrast, Codex CLI covers only one of these scenarios, leaving the broader behavioral consistency unchecked. This suggests that Codex CLI identifies part of the newly exposed behav￾ior but does not recover the full semantic scope reflected in… view at source ↗
read the original abstract

As production code evolves, the test suite must co-evolve to remain effective. Existing benchmarks for test evolution operate at method-level granularity with pre-paired inputs, bypassing the task of locating affected tests from the full project and excluding the need for new tests entirely. We present TEBench, the first project-level benchmark for test evolution. Given a project repository and a code-changing commit, TEBench requires systems to autonomously identify tests requiring modification, determine where new tests are needed, and produce the corresponding test patch. We construct TEBench through a four-stage pipeline over Defects4J projects, curating 314 task instances from 10 projects with developer-written ground truth. Each instance is annotated with one or more of three evolution types: Test-Breaking (tests that fail), Test-Stale (tests that pass but no longer meaningfully validate updated behavior), and Test-Missing (new tests needed for introduced behavior). We evaluate seven configurations spanning three industrial agent frameworks (Claude Code, Codex CLI, OpenCode) and six base models, alongside a heuristic baseline. All seven configurations converge on an identification F1 of 45.7% to 49.4%, revealing a shared performance ceiling across both frameworks and base models. Test-Stale is the most challenging type, averaging F1 around 36%, since configurations rely on execution failure signals and lack proactive semantic reasoning. On the update task, configurations produce highly executable test modifications whose surface form diverges substantially from ground truth. Trajectory analysis reveals a reactive "execute-fail-fix" loop that succeeds for breaking tests but structurally cannot address stale or missing tests. TEBench is available at https://github.com/iSEngLab/TEBench with a leaderboard at https://tebench-leadership.vercel.app.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TEBench, the first project-level benchmark for test evolution. It uses a four-stage pipeline over 10 Defects4J projects to curate 314 task instances, each tied to a code-changing commit and annotated with developer-written ground truth for one or more of three types: Test-Breaking (failing tests), Test-Stale (passing but semantically outdated tests), and Test-Missing (new tests required for added behavior). Seven agent configurations spanning three frameworks and six base models are evaluated on identifying tests to update and generating patches; all converge to identification F1 scores of 45.7–49.4%, with Test-Stale hardest (~36% F1). The work attributes the ceiling to agents' reactive execute-fail-fix behavior, which cannot address semantic cases, and releases the benchmark with a leaderboard.

Significance. If the ground-truth annotations hold, the results establish a reproducible performance ceiling for current coding agents on project-level test co-evolution and isolate the specific failure mode for semantic (non-execution) cases. The public release of TEBench, the 314-instance dataset, and the leaderboard constitute a concrete, falsifiable resource that can drive targeted improvements in agent semantic reasoning and proactive test maintenance. These strengths outweigh the current empirical gaps.

major comments (1)
  1. Benchmark construction (four-stage pipeline over Defects4J): The central claim of a shared 45.7–49.4% identification F1 ceiling and the diagnosis that Test-Stale is hardest (~36% F1) because agents lack semantic reasoning both rest on the accuracy and lack of bias in the developer-written labels for Test-Stale and Test-Missing. The manuscript supplies no inter-annotator agreement, label-distribution statistics, or independent validation that the semantic judgments (e.g., “no longer meaningfully validates updated behavior”) are consistent and representative. If the curation stages systematically over- or under-label stale/missing cases, the observed type-specific gap and the “reactive loop” conclusion become artifacts of the benchmark rather than intrinsic agent limitations.
minor comments (2)
  1. Abstract and evaluation setup: The abstract reports convergence across “seven configurations” and “six base models” without naming them; readers must reach the experimental section to learn the exact frameworks (Claude Code, Codex CLI, OpenCode) and models. Adding a compact table or parenthetical list in the abstract would improve accessibility.
  2. Trajectory analysis: The claim that agents follow an “execute-fail-fix” loop is supported by qualitative examples, but no quantitative breakdown (e.g., percentage of trajectories that terminate without addressing stale/missing tests) is provided. A small table summarizing loop statistics per type would strengthen the evidence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing TEBench. We address the major comment regarding benchmark construction below, providing clarifications and committing to revisions where appropriate.

read point-by-point responses
  1. Referee: Benchmark construction (four-stage pipeline over Defects4J): The central claim of a shared 45.7–49.4% identification F1 ceiling and the diagnosis that Test-Stale is hardest (~36% F1) because agents lack semantic reasoning both rest on the accuracy and lack of bias in the developer-written labels for Test-Stale and Test-Missing. The manuscript supplies no inter-annotator agreement, label-distribution statistics, or independent validation that the semantic judgments (e.g., “no longer meaningfully validates updated behavior”) are consistent and representative. If the curation stages systematically over- or under-label stale/missing cases, the observed type-specific gap and the “reactive loop” conclusion become artifacts of the benchmark rather than intrinsic agent limitations.

    Authors: We appreciate the referee's emphasis on the reliability of our ground-truth annotations. The labels in TEBench are derived from actual developer-written changes in the Defects4J commit history: Test-Breaking tests are those that fail after the code change and were updated in the commit; Test-Stale tests pass execution but were nevertheless modified by developers, reflecting semantic staleness; and Test-Missing cases involve new tests added for the introduced behavior. This approach grounds the annotations in observable developer actions rather than subjective post-hoc judgments. We will revise the manuscript to include label-distribution statistics (e.g., the proportion of each type across the 314 instances) and expand the description of the four-stage pipeline to include details on how potential biases were mitigated during curation. Regarding inter-annotator agreement, because the ground truth is extracted directly from commit diffs and developer modifications rather than independent multi-annotator labeling, traditional IAA metrics do not apply. We will clarify this in the revised version and note it as a characteristic of our benchmark construction. revision: partial

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on independently curated benchmark

full rationale

The paper's core results (agent F1 scores of 45.7–49.4% with Test-Stale at ~36%) are obtained by executing seven agent configurations on the 314 TEBench instances and comparing outputs against the developer-written ground-truth annotations produced by the four-stage Defects4J pipeline. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations are invoked to derive these numbers; they are straightforward evaluation metrics. The benchmark construction itself is a one-time curation step whose validity is an external assumption, not a derivation that reduces to the reported F1 values by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the four-stage curation pipeline and annotations from Defects4J projects, which depend on assumptions about representativeness of selected projects and completeness of developer-written ground truth.

axioms (1)
  • domain assumption Defects4J projects and commits provide representative and diverse examples of real-world code changes requiring test evolution
    The benchmark is constructed through a pipeline over Defects4J projects, assuming these capture sufficient realism for the three evolution types.

pith-pipeline@v0.9.0 · 5642 in / 1401 out tokens · 75095 ms · 2026-05-08T08:59:01.836133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    TDD-Bench verified: Can LLMs generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883,

    Ahmed, T., Hirzel, M., Pan, R., Shinnar, A., and Sinha, S. TDD-Bench verified: Can LLMs generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883,

  2. [2]

    E., Narasimhan, K., and Press, O

    Chowdhury, N., Aider, J., Cassano, F., Zhuo, J., Liu, Q., Jimenez, C. E., Narasimhan, K., and Press, O. SWE- Bench verified: A stricter evaluation for AI software engineering.arXiv preprint arXiv:2406.12952,

  3. [3]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.CoRR, abs/2512.02556,

  4. [4]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Deng, X., Da, J., Pan, E., He, Y . Y ., Ide, C., Garg, K., Lauffer, N., Park, A., Pasari, N., Rane, C., et al. SWE- Bench pro: Can AI agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941,

  5. [5]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM. GLM-5: from vibe coding to agentic engineering. CoRR, abs/2602.15763,

  6. [6]

    Just, R., Jalali, D., and Ernst, M

    URL https://openreview.net/ forum?id=VTF8yNQM66. Just, R., Jalali, D., and Ernst, M. D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA), pp. 437–440,

  7. [7]

    Kimi K2.5: Visual Agentic Intelligence

    Team, K. Kimi K2.5: visual agentic intelligence.CoRR, abs/2602.02276,

  8. [8]

    Qwen3 Technical Report

    Team, Q. Qwen3 technical report.CoRR, abs/2505.09388,

  9. [9]

    SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

    Thai, M. V ., Le, T., Manh, D. N., Nhat, H. P., and Bui, N. D. SWE-Evo: Benchmarking coding agents in long- horizon software evolution scenarios.arXiv preprint arXiv:2512.18470,

  10. [10]

    R., and Ma, L

    Wang, W., Yang, C., Wang, Z., Huang, Y ., Chu, Z., Song, D., Zhang, L., Chen, A. R., and Ma, L. TestEval: Bench- marking large language models for test case generation. InFindings of the Association for Computational Linguis- tics: NAACL 2025, pp. 3547–3562, 2025a. Wang, X., Li, B., Song, Y ., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y ., Li, B., Si...

  11. [11]

    Enhanc- ing LLM-based test generation for hard-to-cover branches via program analysis.arXiv preprint arXiv:2404.04966, 2024a

    Yang, C., Chen, J., Lin, B., Zhou, J., and Wang, Z. Enhanc- ing LLM-based test generation for hard-to-cover branches via program analysis.arXiv preprint arXiv:2404.04966, 2024a. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. SWE-Agent: Agent- computer interfaces enable automated software engineer- ing.Advances in ...

  12. [12]

    Unit test up- date through LLM-driven context collection and error- type-aware refinement.arXiv preprint arXiv:2509.24419,

    Zhang, Y ., Yang, Z., Pan, S., and Liu, Z. Unit test up- date through LLM-driven context collection and error- type-aware refinement.arXiv preprint arXiv:2509.24419,