How do Execution Features Improve Statistical Fault Localization? An Empirical Study

Andreas Zeller; Marius Smytzek

arxiv: 2606.30324 · v1 · pith:NX5E6PP2new · submitted 2026-06-29 · 💻 cs.SE

How do Execution Features Improve Statistical Fault Localization? An Empirical Study

Marius Smytzek , Andreas Zeller This is my paper

Pith reviewed 2026-06-30 05:12 UTC · model grok-4.3

classification 💻 cs.SE

keywords statistical fault localizationexecution featuresempirical studysoftware debuggingrandom forestsmixed-effects modelTests4Py

0 comments

The pith

Augmenting statistical fault localization with execution features improves accuracy and reduces inspection effort.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether statistical fault localization, which ranks lines by pass/fail execution spectra, can be strengthened by adding execution features such as data flow and branch conditions. It uses EFDD to extract these features across Tests4Py subjects, trains per-subject random forests to rank feature importance, maps the importances back to source lines, and blends the resulting weights into existing SFL formulas. The combined rankings are then evaluated on reference-patch accuracy, line- and function-level effort, robustness, and feasibility with a confounder-adjusted mixed-effects model plus paired tests. A reader would care because the added features supply information about why a test fails that plain coverage spectra omit, potentially shortening the search for the faulty line.

Core claim

The study shows that execution features extracted with EFDD and weighted by random-forest importances can be mapped to lines and combined with standard SFL formulas to produce rankings that improve reference-patch accuracy, lower line- and function-level effort, increase robustness, and remain feasible under a mixed-effects analysis on the chosen subjects.

What carries the argument

The mapping of per-subject random-forest importances on EFDD execution features to source-line weights that are then added to SFL suspiciousness scores.

If this is right

Reference-patch accuracy rises when execution-feature weights are included.
Both line-level and function-level developer effort decrease under the augmented rankings.
The gains hold after adjusting for subject-level confounders in a mixed-effects model.
The approach remains computationally feasible on the evaluated test subjects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the mapping works reliably, execution features may supply causal signals that pure spectra lack.
The method could be tested on subjects outside Tests4Py without retraining per project.
Similar weighting might apply to other spectrum-based techniques beyond the formulas tested here.

Load-bearing premise

The per-subject random-forest models produce importances that can be mapped back to lines and combined with SFL formulas without introducing new biases or needing subject-specific tuning that fails to generalize.

What would settle it

A replication on new subjects where the augmented rankings show no gain in reference-patch accuracy or require equal or greater inspection effort compared with plain SFL would falsify the improvement.

Figures

Figures reproduced from arXiv: 2606.30324 by Andreas Zeller, Marius Smytzek.

read the original abstract

Automated fault localization helps developers find faults in large code bases. Statistical fault localization (SFL) ranks suspicious lines from pass/fail spectra, but line execution alone misses information like data-flow, values, or branch conditions that explain why a failure occurs. This study evaluates whether augmenting SFL with execution features improves localization accuracy and developer-oriented inspection effort. We extract execution features with EFDD for all Tests4Py subjects, train per-subject random forests, map importances to source lines, and combine the resulting weights with established SFL formulas. The evaluation measures reference-patch accuracy, line- and function-level effort, robustness, and feasibility using a confounder-adjusted mixed-effects model, corroborated by paired statistical tests and outcome-neutral quality checks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This study finds execution features from EFDD improve SFL metrics on Tests4Py after mixed-effects adjustment, but the per-subject RF importance mapping risks unmeasured biases the model does not isolate.

read the letter

The punchline is that this study finds improvements in statistical fault localization when execution features are added through EFDD and per-subject random forests on the Tests4Py benchmark, but the line-mapping step from feature importances risks adding unmeasured subject-specific biases.

The work is new in its use of EFDD features with random forest importance for weighting SFL formulas and in applying a mixed-effects model with confounder adjustment to the evaluation metrics of reference-patch accuracy, effort, robustness, and feasibility. It does a good job laying out a competent experimental design that includes paired statistical tests and quality checks to support the claims.

The soft spots center on the pipeline details. Training one forest per subject and then mapping importances back to source lines could create confounds if the mapping depends on subject-specific alignments, and the mixed-effects model controls for subject and test-suite size but not necessarily for that mapping process itself. The stress-test concern lands because without a description of a fixed mapping rule or an ablation study, it is difficult to attribute gains solely to the execution features. Since the abstract does not include effect sizes or exclusion details, the strength of the evidence depends on the full results.

This paper is aimed at software engineering researchers focused on debugging and fault localization. A reader who follows SFL literature would find the evaluation protocol and the Tests4Py application valuable for comparison, even if the core idea of richer spectra is not entirely novel.

It deserves serious peer review because the design is thoughtful and the question is relevant to tool builders, though revisions would likely be needed to clarify the mapping and report the numeric outcomes.

Referee Report

1 major / 1 minor

Summary. The paper claims that augmenting statistical fault localization (SFL) with execution features (via EFDD) improves accuracy and reduces developer inspection effort. Per-subject random forests are trained on EFDD features from Tests4Py subjects; feature importances are mapped to source lines and linearly combined with classic SFL formulas. Evaluation uses reference-patch accuracy, line- and function-level effort, robustness, and feasibility, analyzed via a confounder-adjusted mixed-effects model plus paired tests and outcome-neutral checks.

Significance. If the central results hold, the work would supply concrete empirical support for execution-feature augmentation of SFL, with direct implications for tool design. Credit is due for the confounder-adjusted mixed-effects modeling, paired statistical tests, and explicit outcome-neutral quality checks, all of which raise the evidential standard above typical SFL experiments that rely on raw rankings alone.

major comments (1)

[§4] §4 (Mapping step): the per-subject random-forest training followed by importance-to-line mapping is performed after training and is not described as using a fixed, subject-independent rule. Because the mixed-effects model in §5.3 controls only for subject and test-suite size, any subject-specific alignment artifact introduced at the mapping stage remains unisolated and could account for measured gains in reference-patch accuracy or effort metrics. A sensitivity analysis that replaces the learned mapping with a uniform rule would directly test whether the reported improvements are attributable to execution features.

minor comments (1)

[Abstract] Abstract: the phrase 'outcome-neutral quality checks' is used without definition; a one-sentence gloss or pointer to the relevant subsection would prevent reader uncertainty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the mapping step. We address the major comment point-by-point below.

read point-by-point responses

Referee: [§4] §4 (Mapping step): the per-subject random-forest training followed by importance-to-line mapping is performed after training and is not described as using a fixed, subject-independent rule. Because the mixed-effects model in §5.3 controls only for subject and test-suite size, any subject-specific alignment artifact introduced at the mapping stage remains unisolated and could account for measured gains in reference-patch accuracy or effort metrics. A sensitivity analysis that replaces the learned mapping with a uniform rule would directly test whether the reported improvements are attributable to execution features.

Authors: We agree that the per-subject random-forest training and subsequent importance-to-line mapping could introduce subject-specific effects that the current mixed-effects model (controlling only for subject and test-suite size) does not fully isolate. The mapping is an intentional component of our approach, as it derives line-level weights directly from the learned relevance of execution features extracted by EFDD. Nevertheless, to directly test whether observed gains in reference-patch accuracy and effort metrics are attributable to the execution features rather than the learned mapping procedure, we will add the suggested sensitivity analysis. We will replace the RF-derived importances with a uniform, subject-independent rule (e.g., equal weighting across mapped lines or a fixed heuristic independent of per-subject training) and re-evaluate all metrics using the same statistical pipeline. Results will be reported in an expanded §5 with updated tables and discussion. This revision will strengthen the causal attribution to execution features. revision: yes

Circularity Check

0 steps flagged

Empirical study relies on external benchmarks with no self-referential derivations

full rationale

The paper describes an empirical pipeline: extracting EFDD features from Tests4Py subjects, training per-subject random forests, mapping importances to lines, combining with SFL formulas, and evaluating via confounder-adjusted mixed-effects models against reference patches. No equations, predictions, or first-principles results reduce to inputs by construction. All measurements use external data and statistical tests independent of the fitted values. The mapping step is a methodological choice whose effects are assessed externally rather than defined into the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the assumption that EFDD correctly captures execution features and that random-forest importances can be mapped to lines without loss of meaning. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption EFDD extraction and random-forest importance mapping produce line-level weights that are commensurable with existing SFL formulas.
Invoked when the paper states it maps importances to source lines and combines them with SFL formulas.

pith-pipeline@v0.9.1-grok · 5646 in / 1261 out tokens · 27040 ms · 2026-06-30T05:12:50.202768+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 22 canonical work pages

[1]

Visualization of test information to assist fault localization,

J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test information to assist fault localization,” inProceedings of the 24th International Conference on Software Engineering. New York, NY , USA: ACM, 2002, pp. 467–477. [Online]. Available: https://doi.org/10.1145/581339.581397

work page doi:10.1145/581339.581397 2002
[2]

Lightweight fault-localization using multiple coverage types,

R. Santelices, J. A. Jones, Y . Yu, and M. J. Harrold, “Lightweight fault-localization using multiple coverage types,” inProceedings of the 31st International Conference on Software Engineering, ser. ICSE ’09. USA: IEEE Computer Society, 2009, p. 56–66. [Online]. Available: https://doi.org/10.1109/ICSE.2009.5070508

work page doi:10.1109/icse.2009.5070508 2009
[3]

Scalable statistical bug isolation,

B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalable statistical bug isolation,”SIGPLAN Not., vol. 40, no. 6, p. 15–26, jun
[4]

Available: https://doi.org/10.1145/1064978.1065014

[Online]. Available: https://doi.org/10.1145/1064978.1065014

work page doi:10.1145/1064978.1065014
[5]

Empirical evaluation of the Tarantula automatic fault-localization technique,

J. A. Jones and M. J. Harrold, “Empirical evaluation of the Tarantula automatic fault-localization technique,” inProceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’05. New York, NY , USA: Association for Computing Machinery, 2005, p. 273–282. [Online]. Available: https://doi.org/10.1145/1101908.1101949

work page doi:10.1145/1101908.1101949 2005
[6]

An evaluation of similarity coefficients for software fault localization,

R. Abreu, P. Zoeteweij, and A. J. C. v. Gemund, “An evaluation of similarity coefficients for software fault localization,” inProceedings of the 12th Pacific Rim International Symposium on Dependable Computing, ser. PRDC ’06. USA: IEEE Computer Society, 2006, p. 39–46. [Online]. Available: https://doi.org/10.1109/PRDC.2006.18

work page doi:10.1109/prdc.2006.18 2006
[7]

Software fault localization using DStar (D*),

W. E. Wong, V . Debroy, Y . Li, and R. Gao, “Software fault localization using DStar (D*),” in2012 IEEE Sixth International Conference on Software Security and Reliability, 2012, pp. 21–30

2012
[8]

A model for spectra-based software diagnosis,

L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,”ACM Trans. Softw. Eng. Methodol., vol. 20, no. 3, aug 2011. [Online]. Available: https://doi.org/10.1145/2000791.2000795

work page doi:10.1145/2000791.2000795 2011
[9]

On the accuracy of spectrum-based fault localization,

R. Abreu, P. Zoeteweij, and A. J. van Gemund, “On the accuracy of spectrum-based fault localization,” inTesting: Academic and In- dustrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007), 2007, pp. 89–98

2007
[10]

SFLKit: A workbench for statistical fault localization,

M. Smytzek and A. Zeller, “SFLKit: A workbench for statistical fault localization,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY , USA: Association for Computing Machinery, 2022, p. 1701–1705. [Online]. Available: https://doi.org/10.1...

work page doi:10.1145/3540250.3558915 2022
[11]

Evaluating and improving fault localization,

S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” inProceedings of the 39th International Conference on Software Engineering, ser. ICSE ’17. IEEE Press, 2017, p. 609–620. [Online]. Available: https://doi.org/10.1109/ICSE.2017.62

work page doi:10.1109/icse.2017.62 2017
[12]

Boosting spectrum- based fault localization using PageRank,

M. Zhang, X. Li, L. Zhang, and S. Khurshid, “Boosting spectrum- based fault localization using PageRank,” inProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2017. New York, NY , USA: Association for Computing Machinery, 2017, p. 261–272. [Online]. Available: https://doi.org/10.1145/3092703.3092731

work page doi:10.1145/3092703.3092731 2017
[13]

Provably optimal and human-competitive results in sbse for spectrum based fault localisation,

X. Xie, F.-C. Kuo, T. Y . Chen, S. Yoo, and M. Harman, “Provably optimal and human-competitive results in sbse for spectrum based fault localisation,” inProceedings of the 5th International Symposium on Search Based Software Engineering - Volume 8084, ser. SSBSE
[14]

Berlin, Heidelberg: Springer-Verlag, 2013, p. 224–238. [Online]. Available: https://doi.org/10.1007/978-3-642-39742-4_17

work page doi:10.1007/978-3-642-39742-4_17 2013
[15]

Constrained feature selection for localizing faults,

T.-D. B. Le, D. Lo, and M. Li, “Constrained feature selection for localizing faults,” inProceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), ser. ICSME ’15. USA: IEEE Computer Society, 2015, p. 501–505. [Online]. Available: https://doi.org/10.1109/ICSM.2015.7332502

work page doi:10.1109/icsm.2015.7332502 2015
[16]

Precise learn-to-rank fault localization using dynamic and static features of target programs,

Y . Kim, S. Mun, S. Yoo, and M. Kim, “Precise learn-to-rank fault localization using dynamic and static features of target programs,” ACM Trans. Softw. Eng. Methodol., vol. 28, no. 4, oct 2019. [Online]. Available: https://doi.org/10.1145/3345628

work page doi:10.1145/3345628 2019
[17]

and Ray, Tom P

J. Jiang, R. Wang, Y . Xiong, X. Chen, and L. Zhang, “Combining spectrum-based fault localization and statistical debugging: An empirical study,” inProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’19. IEEE Press, 2019, p. 502–514. [Online]. Available: https://doi.org/10.1109/ASE.2019.00054

work page doi:10.1109/ase.2019.00054 2019
[18]

Transforming programs and tests in tandem for fault localization,

X. Li and L. Zhang, “Transforming programs and tests in tandem for fault localization,”Proc. ACM Program. Lang., vol. 1, no. OOPSLA, Oct. 2017. [Online]. Available: https://doi.org/10.1145/3133916

work page doi:10.1145/3133916 2017
[19]

How execution features relate to failures: An empirical study and diagnosis approach,

M. Smytzek, M. Eberlein, L. Grunske, and A. Zeller, “How execution features relate to failures: An empirical study and diagnosis approach,” ACM Trans. Softw. Eng. Methodol., Dec. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3783989

work page doi:10.1145/3783989 2025
[20]

Locating faults with program slicing: an empirical analysis,

E. Soremekun, L. Kirschner, M. Böhme, and A. Zeller, “Locating faults with program slicing: an empirical analysis,”Empirical Softw. Engg., vol. 26, no. 3, may 2021. [Online]. Available: https://doi.org/10.1007/s10664-020-09931-7

work page doi:10.1007/s10664-020-09931-7 2021
[21]

Improving the effectiveness of spectra-based fault localization using specifications,

D. Gopinath, R. N. Zaeem, and S. Khurshid, “Improving the effectiveness of spectra-based fault localization using specifications,” inProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’12. New York, NY , USA: Association for Computing Machinery, 2012, p. 40–49. [Online]. Available: https://doi.org/10.1145...

work page doi:10.1145/2351676.2351683 2012
[22]

A survey of learning-based automated program repair,

Q. Zhang, C. Fang, Y . Ma, W. Sun, and Z. Chen, “A survey of learning-based automated program repair,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 2, dec 2023. [Online]. Available: https://doi.org/10.1145/3631974

work page doi:10.1145/3631974 2023
[23]

Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators,

F. Steimann, M. Frenkel, and R. Abreu, “Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators,” inProceedings of the 2013 International Symposium on Software Testing and Analysis, ser. ISSTA 2013. New York, NY , USA: Association for Computing Machinery, 2013, p. 314–324. [Online]. Available: https://do...

work page doi:10.1145/2483760.2483767 2013
[24]

Impact of code language models on automated program repair,

E. Soremekun, L. Kirschner, M. Böhme, and M. Papadakis, “Evaluating the impact of experimental assumptions in automated fault localization,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. IEEE Press, 2023, p. 159–171. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00025

work page doi:10.1109/icse48619.2023.00025 2023
[25]

Are automated debugging techniques actually helping programmers?

C. Parnin and A. Orso, “Are automated debugging techniques actually helping programmers?” inProceedings of the 2011 International Symposium on Software Testing and Analysis, ser. ISSTA ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 199–209. [Online]. Available: https://doi.org/10.1145/2001420.2001445

work page doi:10.1145/2001420.2001445 2011
[26]

A quantitative and qualitative evaluation of llm-based explainable fault localization,

S. Kang, G. An, and S. Yoo, “A quantitative and qualitative evaluation of llm-based explainable fault localization,”Proc. ACM Softw. Eng., vol. 1, no. FSE, jul 2024. [Online]. Available: https://doi.org/10.1145/3660771

work page doi:10.1145/3660771 2024

[1] [1]

Visualization of test information to assist fault localization,

J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test information to assist fault localization,” inProceedings of the 24th International Conference on Software Engineering. New York, NY , USA: ACM, 2002, pp. 467–477. [Online]. Available: https://doi.org/10.1145/581339.581397

work page doi:10.1145/581339.581397 2002

[2] [2]

Lightweight fault-localization using multiple coverage types,

R. Santelices, J. A. Jones, Y . Yu, and M. J. Harrold, “Lightweight fault-localization using multiple coverage types,” inProceedings of the 31st International Conference on Software Engineering, ser. ICSE ’09. USA: IEEE Computer Society, 2009, p. 56–66. [Online]. Available: https://doi.org/10.1109/ICSE.2009.5070508

work page doi:10.1109/icse.2009.5070508 2009

[3] [3]

Scalable statistical bug isolation,

B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalable statistical bug isolation,”SIGPLAN Not., vol. 40, no. 6, p. 15–26, jun

[4] [4]

Available: https://doi.org/10.1145/1064978.1065014

[Online]. Available: https://doi.org/10.1145/1064978.1065014

work page doi:10.1145/1064978.1065014

[5] [5]

Empirical evaluation of the Tarantula automatic fault-localization technique,

J. A. Jones and M. J. Harrold, “Empirical evaluation of the Tarantula automatic fault-localization technique,” inProceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’05. New York, NY , USA: Association for Computing Machinery, 2005, p. 273–282. [Online]. Available: https://doi.org/10.1145/1101908.1101949

work page doi:10.1145/1101908.1101949 2005

[6] [6]

An evaluation of similarity coefficients for software fault localization,

R. Abreu, P. Zoeteweij, and A. J. C. v. Gemund, “An evaluation of similarity coefficients for software fault localization,” inProceedings of the 12th Pacific Rim International Symposium on Dependable Computing, ser. PRDC ’06. USA: IEEE Computer Society, 2006, p. 39–46. [Online]. Available: https://doi.org/10.1109/PRDC.2006.18

work page doi:10.1109/prdc.2006.18 2006

[7] [7]

Software fault localization using DStar (D*),

W. E. Wong, V . Debroy, Y . Li, and R. Gao, “Software fault localization using DStar (D*),” in2012 IEEE Sixth International Conference on Software Security and Reliability, 2012, pp. 21–30

2012

[8] [8]

A model for spectra-based software diagnosis,

L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,”ACM Trans. Softw. Eng. Methodol., vol. 20, no. 3, aug 2011. [Online]. Available: https://doi.org/10.1145/2000791.2000795

work page doi:10.1145/2000791.2000795 2011

[9] [9]

On the accuracy of spectrum-based fault localization,

R. Abreu, P. Zoeteweij, and A. J. van Gemund, “On the accuracy of spectrum-based fault localization,” inTesting: Academic and In- dustrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007), 2007, pp. 89–98

2007

[10] [10]

SFLKit: A workbench for statistical fault localization,

M. Smytzek and A. Zeller, “SFLKit: A workbench for statistical fault localization,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY , USA: Association for Computing Machinery, 2022, p. 1701–1705. [Online]. Available: https://doi.org/10.1...

work page doi:10.1145/3540250.3558915 2022

[11] [11]

Evaluating and improving fault localization,

S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” inProceedings of the 39th International Conference on Software Engineering, ser. ICSE ’17. IEEE Press, 2017, p. 609–620. [Online]. Available: https://doi.org/10.1109/ICSE.2017.62

work page doi:10.1109/icse.2017.62 2017

[12] [12]

Boosting spectrum- based fault localization using PageRank,

M. Zhang, X. Li, L. Zhang, and S. Khurshid, “Boosting spectrum- based fault localization using PageRank,” inProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2017. New York, NY , USA: Association for Computing Machinery, 2017, p. 261–272. [Online]. Available: https://doi.org/10.1145/3092703.3092731

work page doi:10.1145/3092703.3092731 2017

[13] [13]

Provably optimal and human-competitive results in sbse for spectrum based fault localisation,

X. Xie, F.-C. Kuo, T. Y . Chen, S. Yoo, and M. Harman, “Provably optimal and human-competitive results in sbse for spectrum based fault localisation,” inProceedings of the 5th International Symposium on Search Based Software Engineering - Volume 8084, ser. SSBSE

[14] [14]

Berlin, Heidelberg: Springer-Verlag, 2013, p. 224–238. [Online]. Available: https://doi.org/10.1007/978-3-642-39742-4_17

work page doi:10.1007/978-3-642-39742-4_17 2013

[15] [15]

Constrained feature selection for localizing faults,

T.-D. B. Le, D. Lo, and M. Li, “Constrained feature selection for localizing faults,” inProceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), ser. ICSME ’15. USA: IEEE Computer Society, 2015, p. 501–505. [Online]. Available: https://doi.org/10.1109/ICSM.2015.7332502

work page doi:10.1109/icsm.2015.7332502 2015

[16] [16]

Precise learn-to-rank fault localization using dynamic and static features of target programs,

Y . Kim, S. Mun, S. Yoo, and M. Kim, “Precise learn-to-rank fault localization using dynamic and static features of target programs,” ACM Trans. Softw. Eng. Methodol., vol. 28, no. 4, oct 2019. [Online]. Available: https://doi.org/10.1145/3345628

work page doi:10.1145/3345628 2019

[17] [17]

and Ray, Tom P

J. Jiang, R. Wang, Y . Xiong, X. Chen, and L. Zhang, “Combining spectrum-based fault localization and statistical debugging: An empirical study,” inProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’19. IEEE Press, 2019, p. 502–514. [Online]. Available: https://doi.org/10.1109/ASE.2019.00054

work page doi:10.1109/ase.2019.00054 2019

[18] [18]

Transforming programs and tests in tandem for fault localization,

X. Li and L. Zhang, “Transforming programs and tests in tandem for fault localization,”Proc. ACM Program. Lang., vol. 1, no. OOPSLA, Oct. 2017. [Online]. Available: https://doi.org/10.1145/3133916

work page doi:10.1145/3133916 2017

[19] [19]

How execution features relate to failures: An empirical study and diagnosis approach,

M. Smytzek, M. Eberlein, L. Grunske, and A. Zeller, “How execution features relate to failures: An empirical study and diagnosis approach,” ACM Trans. Softw. Eng. Methodol., Dec. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3783989

work page doi:10.1145/3783989 2025

[20] [20]

Locating faults with program slicing: an empirical analysis,

E. Soremekun, L. Kirschner, M. Böhme, and A. Zeller, “Locating faults with program slicing: an empirical analysis,”Empirical Softw. Engg., vol. 26, no. 3, may 2021. [Online]. Available: https://doi.org/10.1007/s10664-020-09931-7

work page doi:10.1007/s10664-020-09931-7 2021

[21] [21]

Improving the effectiveness of spectra-based fault localization using specifications,

D. Gopinath, R. N. Zaeem, and S. Khurshid, “Improving the effectiveness of spectra-based fault localization using specifications,” inProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’12. New York, NY , USA: Association for Computing Machinery, 2012, p. 40–49. [Online]. Available: https://doi.org/10.1145...

work page doi:10.1145/2351676.2351683 2012

[22] [22]

A survey of learning-based automated program repair,

Q. Zhang, C. Fang, Y . Ma, W. Sun, and Z. Chen, “A survey of learning-based automated program repair,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 2, dec 2023. [Online]. Available: https://doi.org/10.1145/3631974

work page doi:10.1145/3631974 2023

[23] [23]

Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators,

F. Steimann, M. Frenkel, and R. Abreu, “Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators,” inProceedings of the 2013 International Symposium on Software Testing and Analysis, ser. ISSTA 2013. New York, NY , USA: Association for Computing Machinery, 2013, p. 314–324. [Online]. Available: https://do...

work page doi:10.1145/2483760.2483767 2013

[24] [24]

Impact of code language models on automated program repair,

E. Soremekun, L. Kirschner, M. Böhme, and M. Papadakis, “Evaluating the impact of experimental assumptions in automated fault localization,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. IEEE Press, 2023, p. 159–171. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00025

work page doi:10.1109/icse48619.2023.00025 2023

[25] [25]

Are automated debugging techniques actually helping programmers?

C. Parnin and A. Orso, “Are automated debugging techniques actually helping programmers?” inProceedings of the 2011 International Symposium on Software Testing and Analysis, ser. ISSTA ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 199–209. [Online]. Available: https://doi.org/10.1145/2001420.2001445

work page doi:10.1145/2001420.2001445 2011

[26] [26]

A quantitative and qualitative evaluation of llm-based explainable fault localization,

S. Kang, G. An, and S. Yoo, “A quantitative and qualitative evaluation of llm-based explainable fault localization,”Proc. ACM Softw. Eng., vol. 1, no. FSE, jul 2024. [Online]. Available: https://doi.org/10.1145/3660771

work page doi:10.1145/3660771 2024