arxiv: 2604.26674 · v1 · submitted 2026-04-29 · 💻 cs.SE

Recognition: unknown

Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

Adam Krafczyk , Klaus Schmid

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:05 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated program repairDefects4Jreproducibilitytest suitesbenchmark datasetsAPR evaluationflaky testssoftware repair

0 comments

The pith

Over 28% of Defects4J defects fail strict reproducibility checks needed for reliable automated program repair experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines practical challenges when using the Defects4J benchmark to evaluate automated program repair tools. It applies stricter test execution rules than standard repeatability and finds 180 defects unsuitable because their test suites do not behave consistently under those rules. It further shows that another 59 defects have test suites so loose that deleting one statement passes every test even though the human-written fix changes more code. These problems matter because APR comparisons depend on test suites that correctly signal when a repair succeeds or fails, and undetected issues can produce misleading results across the field.

Core claim

When executing the test suites with strict requirements for reproducibility in APR settings beyond merely reproducing the defect via test cases, 180 (21.6 %) of the defects are not suitable for evaluation experiments. Further, an additional 59 (7.1 %) defects have test suites that are obviously under-specified, as deleting a single statement from the code base makes all test cases pass, although the human-written patch does not only delete code.

What carries the argument

The authors' set of strict reproducibility requirements for APR test execution together with the single-statement deletion test that detects obviously under-specified test suites.

If this is right

APR tool comparisons performed on the unfiltered Defects4J dataset risk including results from defects whose test suites do not reliably indicate repair success.
The supplied evaluation framework can automatically flag flaky tests and other hidden problems that standard repeatability checks miss.
Researchers should filter Defects4J or apply similar additional checks before running APR experiments to avoid invalid conclusions.
New defect benchmarks for APR need explicit criteria for test-suite adequacy beyond basic defect reproduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same kinds of hidden test-suite problems may exist in other widely used software engineering benchmarks and could affect conclusions in those fields as well.
Some published APR results may reflect adaptation to dataset quirks rather than genuine repair capability that would hold on better-specified tests.
Widespread adoption of these checks could pressure open-source projects to strengthen their test suites so they remain useful for automated repair research.

Load-bearing premise

The authors' chosen strict reproducibility requirements and single-statement deletion test are the right general criteria for deciding whether a defect dataset supports valid APR evaluation.

What would settle it

Re-running every Defects4J test suite through an independent implementation of the authors' strict checking rules and obtaining substantially different counts of unsuitable defects.

Figures

Figures reproduced from arXiv: 2604.26674 by Adam Krafczyk, Klaus Schmid.

**Figure 2.** Figure 2: Exclusion reasons of non-workable defects in Defects4J 2.0.0 The largest reason for exclusion is that running single tests individually results in different results than running the test suite as a whole (“inconsistent test suite” in view at source ↗

**Figure 1.** Figure 1: Ratio of workable defects in Defects4J 2.0.0 setup-test described in Section 4.1. We can see that only 78.4 % of the defects in the dataset are actually workable in the context of APR evaluations view at source ↗

**Figure 3.** Figure 3: Comparison of human-written and automatically view at source ↗

read the original abstract

In the research of automated program repair (APR), benchmark datasets consisting of known defects in combination with test suites that indicate the defects are of high importance. They allow for an evidence-based comparison of different APR approaches. In our own work on APR we found significant challenges when working with widely used defect datasets, which go beyond mere repeatability of defects via test cases. We summarize these identified challenges and related lessons learned to bring them to the attention of the APR community and quantify the potential impact of them. In particular, we investigate the widely used benchmark Defects4J, which has according to Google Scholar over 1,800 citations. It consists of 835 defects from 17 open-source Java projects; a hand-curated collection of defects, test suites that clearly indicate the defect, and human patches where any unrelated changes are removed. We find that, when executing the test suites with strict requirements for reproducibility in APR settings (beyond merely reproducing the defect via test cases), 180 (21.6 %) of the defects are not suitable for evaluation experiments. Further, we find that an additional 59 (7.1 %) defects have test suites that are obviously under-specified, as deleting a single statement from the code base makes all test cases pass, although the human-written patch does not only delete code. Our contributions are: a systematic collection of requirements for defect datasets for APR beyond traditional reproducibility of defects, a description of practical experiences and quantitative analysis of problems with the Defects4J dataset, as well as an implementation of an evaluation framework for APR tools for Java programs. This evaluation framework does stricter checking for indications of inadequate test suites, to avoid otherwise unnoticed problems in the test suite, such as flaky tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript investigates reproducibility challenges in using the Defects4J benchmark for automated program repair (APR). It reports that 180 (21.6%) of the 835 defects are unsuitable for APR evaluation experiments under strict reproducibility requirements beyond basic defect reproduction via tests, and an additional 59 (7.1%) have under-specified test suites as shown by cases where deleting a single statement makes all tests pass despite the human patch not being a pure deletion. Contributions include a systematic collection of requirements for APR defect datasets, quantitative analysis of Defects4J problems, and an open evaluation framework for Java APR tools that applies stricter checks for issues like flaky tests.

Significance. If the quantitative findings hold, the work is significant because Defects4J is a standard benchmark with over 1,800 citations; documenting these limitations and releasing a stricter evaluation framework could improve the reliability of APR experiments. The open implementation and focus on practical experiences are strengths that support reproducibility in the field.

major comments (2)

[Quantitative analysis / results on unsuitable defects] The section on quantitative analysis of unsuitable defects provides the 180 (21.6%) figure and mentions 'strict requirements for reproducibility in APR settings' but gives no concrete details on the exact criteria used, the execution process, manual verification steps, or error rates in classification. This makes the central claim hard to verify independently.
[Analysis of under-specified test suites] In the analysis of the 59 (7.1%) under-specified defects, the single-statement deletion test is used to flag cases where tests pass but the human patch is not a pure deletion. However, this does not consider that other minimal, semantically valid patches could also pass the tests, which would not necessarily render the defect unsuitable for APR evaluation; no validation against actual APR tool outputs or prior published uses of these defects is provided.

minor comments (1)

[Abstract] The abstract states the contributions but does not include a link or citation to the released evaluation framework implementation, which should be added to enable immediate access.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your detailed and constructive referee report on our manuscript. We appreciate the feedback aimed at improving the clarity and verifiability of our findings. We respond to each major comment below and will make the necessary revisions to the manuscript.

read point-by-point responses

Referee: [Quantitative analysis / results on unsuitable defects] The section on quantitative analysis of unsuitable defects provides the 180 (21.6%) figure and mentions 'strict requirements for reproducibility in APR settings' but gives no concrete details on the exact criteria used, the execution process, manual verification steps, or error rates in classification. This makes the central claim hard to verify independently.

Authors: We thank the referee for highlighting the need for greater transparency in our quantitative analysis. The criteria for determining unsuitability are outlined in detail in Section 3, which presents a systematic collection of requirements for APR defect datasets beyond basic test-based reproduction. These include checks for issues such as flaky tests, non-reproducible builds, and other factors affecting strict reproducibility. The analysis was conducted by applying our open-source evaluation framework to the Defects4J dataset, automating many of the checks. Manual verification was performed for ambiguous cases. We agree that more explicit description of the process and any limitations in classification would aid verification. In the revised version, we will add a dedicated subsection elaborating on the execution steps, manual review process, and potential sources of classification uncertainty. Note that formal error rates (e.g., inter-rater reliability) were not computed as the process was not multi-annotator based, but we will discuss this aspect. revision: yes
Referee: [Analysis of under-specified test suites] In the analysis of the 59 (7.1%) under-specified defects, the single-statement deletion test is used to flag cases where tests pass but the human patch is not a pure deletion. However, this does not consider that other minimal, semantically valid patches could also pass the tests, which would not necessarily render the defect unsuitable for APR evaluation; no validation against actual APR tool outputs or prior published uses of these defects is provided.

Authors: We agree that the single-statement deletion heuristic is not exhaustive and that other minimal patches could potentially satisfy the tests without matching the human patch exactly. Nevertheless, identifying cases where a pure deletion passes all tests—while the human patch involves non-deletion changes—provides strong evidence of under-specification, as the test suite does not distinguish between the intended fix and an alternative (and likely incorrect) modification. This is particularly relevant for APR evaluation, where tools might exploit such weaknesses. We did not perform validation against specific APR tool outputs in the current manuscript, focusing instead on dataset-level properties, but we recognize the value of this and will add an analysis in the revision, such as examining outputs from representative APR tools on these defects. We will also reference prior publications that have used these defects and discuss potential impacts. These additions will be made during the major revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected.

full rationale

The paper conducts an empirical study by running Defects4J test suites under additional reproducibility constraints and applying a single-statement deletion check to flag under-specified suites. These operations are direct executions against the existing codebase and produce the reported counts (180 unsuitable, 59 under-specified) as observational outputs. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations are present that would reduce the central claims to their own inputs by construction. The stated requirements for APR datasets are presented as lessons from practice rather than derived results that presuppose the findings.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims depend on two domain-specific definitions of test-suite adequacy that the paper itself introduces as contributions.

axioms (2)

domain assumption Strict reproducibility requirements that go beyond merely reproducing the defect via test cases are necessary for valid APR evaluation.
Invoked when declaring 180 defects unsuitable.
domain assumption A test suite is obviously under-specified if deleting any single statement causes all tests to pass while the human patch changes more than that statement.
Used to identify the additional 59 defects.

pith-pipeline@v0.9.0 · 5612 in / 1311 out tokens · 89603 ms · 2026-05-07T13:05:59.869996+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 2 canonical work pages

[1]

Defects4j: A database of existing faults to enable controlled testing studies for java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs, ” inProceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA), 2014, pp. 437– 440

2014
[2]

From bugs to bench- marks: A comprehensive survey of software defect datasets,

H.-N. Zhu, R. M. Furth, M. Pradel, and C. Rubio-González, “From bugs to bench- marks: A comprehensive survey of software defect datasets, ”arXiv preprint arXiv:2504.17977, 2025

work page arXiv 2025
[3]

On the reproducibility of software defect datasets,

H.-N. Zhu and C. Rubio-González, “On the reproducibility of software defect datasets, ” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2324–2335

2023
[4]

Over half of psychology studies fail reproducibility test,

M. Baker, “Over half of psychology studies fail reproducibility test, ” News article in Nature, 2015, last visited: 02.05.2021. [Online]. Available: https://www.nature. com/news/over-half-of-psychology-studies-fail-reproducibility-test-1.18248

2015
[5]

Reviewing reproducibility in software engineering research,

A. F. R. Cordeiro and E. Oliveira, Jr., “Reviewing reproducibility in software engineering research, ” inProceedings of the 27th International Conference on Enterprise Information Systems (ICEIS), vol. 2, 2025, pp. 364–371

2025
[6]

On the reproducibility of empirical soft- ware engineering studies based on data retrieved from development repositories,

J. M. González-Barahona and G. Robles, “On the reproducibility of empirical soft- ware engineering studies based on data retrieved from development repositories, ” Empirical Software Engineering, vol. 17, pp. 75–89, Oct. 2012

2012
[7]

Revisiting the reproducibility of em- pirical software engineering studies based on data retrieved from development repositories,

J. M. Gonzalez-Barahona and G. Robles, “Revisiting the reproducibility of em- pirical software engineering studies based on data retrieved from development repositories, ”Information and Software Technology, vol. 164, p. 107318, Dec. 2023

2023
[8]

On the impact of flaky tests in automated program repair,

Y. Qin, S. Wang, K. Liu, X. Mao, and T. F. Bissyandé, “On the impact of flaky tests in automated program repair, ” inProceedings of the 28th International Conference on Software Analysis, Evolution and Reengineering (SANER’21). IEEE, Mar. 2021, pp. 295–306

2021
[9]

Debugging flaky tests using spectrum-based fault localization,

M. Gruber and G. Fraser, “Debugging flaky tests using spectrum-based fault localization, ” inProceedings of the 4th International Conference on Automation of Software Test (AST’23). IEEE, May 2023, pp. 128–139

2023
[10]

Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset,

M. Martinez, T. Durieux, R. Sommerard, J. Xuan, and M. Monperrus, “Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset, ” Empirical Software Engineering, vol. 22, no. 4, pp. 1936–1964, Aug. 2017

1936
[11]

Revisiting defects4j for fault localization in diverse development scenarios,

M. N. Rafi, A. R. Chen, T.-H. P. Chen, and S. Wang, “Revisiting defects4j for fault localization in diverse development scenarios, ” in2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 2025, pp. 63–75

2025
[12]

Dissection of a bug dataset: Anatomy of 395 patches from defects4j,

V. Sobreira, T. Durieux, F. Madeiral, M. Monperrus, and M. de Almeida Maia, “Dissection of a bug dataset: Anatomy of 395 patches from defects4j, ” in25th Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER), 2018

2018
[13]

Zhang, T

Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen, “A critical review of large language model on software engineering: An example from chatgpt and automated program repair, ”arXiv preprint arXiv:2310.08879, 2024

work page arXiv 2024
[14]

The github recent bugs dataset for eval- uating llm-based debugging applications,

J. Y. Lee, S. Kang, J. Yoon, and S. Yoo, “The github recent bugs dataset for eval- uating llm-based debugging applications, ” in2024 IEEE Conference on Software Testing, Verification and Validation (ICST), 2024, pp. 442–444

2024
[15]

Flacoco: Fault localization for java based on industry-grade coverage,

A. Silva, M. Martinez, B. Danglot, D. Ginelli, and M. Monperrus, “Flacoco: Fault localization for java based on industry-grade coverage, ” 2021

2021
[16]

An analysis of the search spaces for generate and validate patch generation systems,

F. Long and M. Rinard, “An analysis of the search spaces for generate and validate patch generation systems, ” inProceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 702–713

2016
[17]

An evaluation of similarity coefficients for software fault localization,

R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “An evaluation of similarity coefficients for software fault localization, ” in2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC’06). IEEE, 2006, pp. 39–46

2006
[18]

Astor: Exploring the design space of generate- and-validate program repair beyond GenProg,

M. Martinez and M. Monperrus, “Astor: Exploring the design space of generate- and-validate program repair beyond GenProg, ”Journal of Systems and Software, vol. 151, pp. 65–80, May 2019

2019