pith. sign in

arxiv: 2606.30324 · v1 · pith:NX5E6PP2new · submitted 2026-06-29 · 💻 cs.SE

How do Execution Features Improve Statistical Fault Localization? An Empirical Study

Pith reviewed 2026-06-30 05:12 UTC · model grok-4.3

classification 💻 cs.SE
keywords statistical fault localizationexecution featuresempirical studysoftware debuggingrandom forestsmixed-effects modelTests4Py
0
0 comments X

The pith

Augmenting statistical fault localization with execution features improves accuracy and reduces inspection effort.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether statistical fault localization, which ranks lines by pass/fail execution spectra, can be strengthened by adding execution features such as data flow and branch conditions. It uses EFDD to extract these features across Tests4Py subjects, trains per-subject random forests to rank feature importance, maps the importances back to source lines, and blends the resulting weights into existing SFL formulas. The combined rankings are then evaluated on reference-patch accuracy, line- and function-level effort, robustness, and feasibility with a confounder-adjusted mixed-effects model plus paired tests. A reader would care because the added features supply information about why a test fails that plain coverage spectra omit, potentially shortening the search for the faulty line.

Core claim

The study shows that execution features extracted with EFDD and weighted by random-forest importances can be mapped to lines and combined with standard SFL formulas to produce rankings that improve reference-patch accuracy, lower line- and function-level effort, increase robustness, and remain feasible under a mixed-effects analysis on the chosen subjects.

What carries the argument

The mapping of per-subject random-forest importances on EFDD execution features to source-line weights that are then added to SFL suspiciousness scores.

If this is right

  • Reference-patch accuracy rises when execution-feature weights are included.
  • Both line-level and function-level developer effort decrease under the augmented rankings.
  • The gains hold after adjusting for subject-level confounders in a mixed-effects model.
  • The approach remains computationally feasible on the evaluated test subjects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the mapping works reliably, execution features may supply causal signals that pure spectra lack.
  • The method could be tested on subjects outside Tests4Py without retraining per project.
  • Similar weighting might apply to other spectrum-based techniques beyond the formulas tested here.

Load-bearing premise

The per-subject random-forest models produce importances that can be mapped back to lines and combined with SFL formulas without introducing new biases or needing subject-specific tuning that fails to generalize.

What would settle it

A replication on new subjects where the augmented rankings show no gain in reference-patch accuracy or require equal or greater inspection effort compared with plain SFL would falsify the improvement.

Figures

Figures reproduced from arXiv: 2606.30324 by Andreas Zeller, Marius Smytzek.

Figure 1
Figure 1. Figure 1: Fault Localization. SFL ranks Lines 6 (correct) and 7 (fault) equally [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
read the original abstract

Automated fault localization helps developers find faults in large code bases. Statistical fault localization (SFL) ranks suspicious lines from pass/fail spectra, but line execution alone misses information like data-flow, values, or branch conditions that explain why a failure occurs. This study evaluates whether augmenting SFL with execution features improves localization accuracy and developer-oriented inspection effort. We extract execution features with EFDD for all Tests4Py subjects, train per-subject random forests, map importances to source lines, and combine the resulting weights with established SFL formulas. The evaluation measures reference-patch accuracy, line- and function-level effort, robustness, and feasibility using a confounder-adjusted mixed-effects model, corroborated by paired statistical tests and outcome-neutral quality checks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that augmenting statistical fault localization (SFL) with execution features (via EFDD) improves accuracy and reduces developer inspection effort. Per-subject random forests are trained on EFDD features from Tests4Py subjects; feature importances are mapped to source lines and linearly combined with classic SFL formulas. Evaluation uses reference-patch accuracy, line- and function-level effort, robustness, and feasibility, analyzed via a confounder-adjusted mixed-effects model plus paired tests and outcome-neutral checks.

Significance. If the central results hold, the work would supply concrete empirical support for execution-feature augmentation of SFL, with direct implications for tool design. Credit is due for the confounder-adjusted mixed-effects modeling, paired statistical tests, and explicit outcome-neutral quality checks, all of which raise the evidential standard above typical SFL experiments that rely on raw rankings alone.

major comments (1)
  1. [§4] §4 (Mapping step): the per-subject random-forest training followed by importance-to-line mapping is performed after training and is not described as using a fixed, subject-independent rule. Because the mixed-effects model in §5.3 controls only for subject and test-suite size, any subject-specific alignment artifact introduced at the mapping stage remains unisolated and could account for measured gains in reference-patch accuracy or effort metrics. A sensitivity analysis that replaces the learned mapping with a uniform rule would directly test whether the reported improvements are attributable to execution features.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'outcome-neutral quality checks' is used without definition; a one-sentence gloss or pointer to the relevant subsection would prevent reader uncertainty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the mapping step. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [§4] §4 (Mapping step): the per-subject random-forest training followed by importance-to-line mapping is performed after training and is not described as using a fixed, subject-independent rule. Because the mixed-effects model in §5.3 controls only for subject and test-suite size, any subject-specific alignment artifact introduced at the mapping stage remains unisolated and could account for measured gains in reference-patch accuracy or effort metrics. A sensitivity analysis that replaces the learned mapping with a uniform rule would directly test whether the reported improvements are attributable to execution features.

    Authors: We agree that the per-subject random-forest training and subsequent importance-to-line mapping could introduce subject-specific effects that the current mixed-effects model (controlling only for subject and test-suite size) does not fully isolate. The mapping is an intentional component of our approach, as it derives line-level weights directly from the learned relevance of execution features extracted by EFDD. Nevertheless, to directly test whether observed gains in reference-patch accuracy and effort metrics are attributable to the execution features rather than the learned mapping procedure, we will add the suggested sensitivity analysis. We will replace the RF-derived importances with a uniform, subject-independent rule (e.g., equal weighting across mapped lines or a fixed heuristic independent of per-subject training) and re-evaluate all metrics using the same statistical pipeline. Results will be reported in an expanded §5 with updated tables and discussion. This revision will strengthen the causal attribution to execution features. revision: yes

Circularity Check

0 steps flagged

Empirical study relies on external benchmarks with no self-referential derivations

full rationale

The paper describes an empirical pipeline: extracting EFDD features from Tests4Py subjects, training per-subject random forests, mapping importances to lines, combining with SFL formulas, and evaluating via confounder-adjusted mixed-effects models against reference patches. No equations, predictions, or first-principles results reduce to inputs by construction. All measurements use external data and statistical tests independent of the fitted values. The mapping step is a methodological choice whose effects are assessed externally rather than defined into the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the assumption that EFDD correctly captures execution features and that random-forest importances can be mapped to lines without loss of meaning. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption EFDD extraction and random-forest importance mapping produce line-level weights that are commensurable with existing SFL formulas.
    Invoked when the paper states it maps importances to source lines and combines them with SFL formulas.

pith-pipeline@v0.9.1-grok · 5646 in / 1261 out tokens · 27040 ms · 2026-06-30T05:12:50.202768+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 22 canonical work pages

  1. [1]

    Visualization of test information to assist fault localization,

    J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test information to assist fault localization,” inProceedings of the 24th International Conference on Software Engineering. New York, NY , USA: ACM, 2002, pp. 467–477. [Online]. Available: https://doi.org/10.1145/581339.581397

  2. [2]

    Lightweight fault-localization using multiple coverage types,

    R. Santelices, J. A. Jones, Y . Yu, and M. J. Harrold, “Lightweight fault-localization using multiple coverage types,” inProceedings of the 31st International Conference on Software Engineering, ser. ICSE ’09. USA: IEEE Computer Society, 2009, p. 56–66. [Online]. Available: https://doi.org/10.1109/ICSE.2009.5070508

  3. [3]

    Scalable statistical bug isolation,

    B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalable statistical bug isolation,”SIGPLAN Not., vol. 40, no. 6, p. 15–26, jun

  4. [4]

    Available: https://doi.org/10.1145/1064978.1065014

    [Online]. Available: https://doi.org/10.1145/1064978.1065014

  5. [5]

    Empirical evaluation of the Tarantula automatic fault-localization technique,

    J. A. Jones and M. J. Harrold, “Empirical evaluation of the Tarantula automatic fault-localization technique,” inProceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’05. New York, NY , USA: Association for Computing Machinery, 2005, p. 273–282. [Online]. Available: https://doi.org/10.1145/1101908.1101949

  6. [6]

    An evaluation of similarity coefficients for software fault localization,

    R. Abreu, P. Zoeteweij, and A. J. C. v. Gemund, “An evaluation of similarity coefficients for software fault localization,” inProceedings of the 12th Pacific Rim International Symposium on Dependable Computing, ser. PRDC ’06. USA: IEEE Computer Society, 2006, p. 39–46. [Online]. Available: https://doi.org/10.1109/PRDC.2006.18

  7. [7]

    Software fault localization using DStar (D*),

    W. E. Wong, V . Debroy, Y . Li, and R. Gao, “Software fault localization using DStar (D*),” in2012 IEEE Sixth International Conference on Software Security and Reliability, 2012, pp. 21–30

  8. [8]

    A model for spectra-based software diagnosis,

    L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,”ACM Trans. Softw. Eng. Methodol., vol. 20, no. 3, aug 2011. [Online]. Available: https://doi.org/10.1145/2000791.2000795

  9. [9]

    On the accuracy of spectrum-based fault localization,

    R. Abreu, P. Zoeteweij, and A. J. van Gemund, “On the accuracy of spectrum-based fault localization,” inTesting: Academic and In- dustrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007), 2007, pp. 89–98

  10. [10]

    SFLKit: A workbench for statistical fault localization,

    M. Smytzek and A. Zeller, “SFLKit: A workbench for statistical fault localization,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY , USA: Association for Computing Machinery, 2022, p. 1701–1705. [Online]. Available: https://doi.org/10.1...

  11. [11]

    Evaluating and improving fault localization,

    S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” inProceedings of the 39th International Conference on Software Engineering, ser. ICSE ’17. IEEE Press, 2017, p. 609–620. [Online]. Available: https://doi.org/10.1109/ICSE.2017.62

  12. [12]

    Boosting spectrum- based fault localization using PageRank,

    M. Zhang, X. Li, L. Zhang, and S. Khurshid, “Boosting spectrum- based fault localization using PageRank,” inProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2017. New York, NY , USA: Association for Computing Machinery, 2017, p. 261–272. [Online]. Available: https://doi.org/10.1145/3092703.3092731

  13. [13]

    Provably optimal and human-competitive results in sbse for spectrum based fault localisation,

    X. Xie, F.-C. Kuo, T. Y . Chen, S. Yoo, and M. Harman, “Provably optimal and human-competitive results in sbse for spectrum based fault localisation,” inProceedings of the 5th International Symposium on Search Based Software Engineering - Volume 8084, ser. SSBSE

  14. [14]

    Berlin, Heidelberg: Springer-Verlag, 2013, p. 224–238. [Online]. Available: https://doi.org/10.1007/978-3-642-39742-4_17

  15. [15]

    Constrained feature selection for localizing faults,

    T.-D. B. Le, D. Lo, and M. Li, “Constrained feature selection for localizing faults,” inProceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), ser. ICSME ’15. USA: IEEE Computer Society, 2015, p. 501–505. [Online]. Available: https://doi.org/10.1109/ICSM.2015.7332502

  16. [16]

    Precise learn-to-rank fault localization using dynamic and static features of target programs,

    Y . Kim, S. Mun, S. Yoo, and M. Kim, “Precise learn-to-rank fault localization using dynamic and static features of target programs,” ACM Trans. Softw. Eng. Methodol., vol. 28, no. 4, oct 2019. [Online]. Available: https://doi.org/10.1145/3345628

  17. [17]

    and Ray, Tom P

    J. Jiang, R. Wang, Y . Xiong, X. Chen, and L. Zhang, “Combining spectrum-based fault localization and statistical debugging: An empirical study,” inProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’19. IEEE Press, 2019, p. 502–514. [Online]. Available: https://doi.org/10.1109/ASE.2019.00054

  18. [18]

    Transforming programs and tests in tandem for fault localization,

    X. Li and L. Zhang, “Transforming programs and tests in tandem for fault localization,”Proc. ACM Program. Lang., vol. 1, no. OOPSLA, Oct. 2017. [Online]. Available: https://doi.org/10.1145/3133916

  19. [19]

    How execution features relate to failures: An empirical study and diagnosis approach,

    M. Smytzek, M. Eberlein, L. Grunske, and A. Zeller, “How execution features relate to failures: An empirical study and diagnosis approach,” ACM Trans. Softw. Eng. Methodol., Dec. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3783989

  20. [20]

    Locating faults with program slicing: an empirical analysis,

    E. Soremekun, L. Kirschner, M. Böhme, and A. Zeller, “Locating faults with program slicing: an empirical analysis,”Empirical Softw. Engg., vol. 26, no. 3, may 2021. [Online]. Available: https://doi.org/10.1007/s10664-020-09931-7

  21. [21]

    Improving the effectiveness of spectra-based fault localization using specifications,

    D. Gopinath, R. N. Zaeem, and S. Khurshid, “Improving the effectiveness of spectra-based fault localization using specifications,” inProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’12. New York, NY , USA: Association for Computing Machinery, 2012, p. 40–49. [Online]. Available: https://doi.org/10.1145...

  22. [22]

    A survey of learning-based automated program repair,

    Q. Zhang, C. Fang, Y . Ma, W. Sun, and Z. Chen, “A survey of learning-based automated program repair,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 2, dec 2023. [Online]. Available: https://doi.org/10.1145/3631974

  23. [23]

    Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators,

    F. Steimann, M. Frenkel, and R. Abreu, “Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators,” inProceedings of the 2013 International Symposium on Software Testing and Analysis, ser. ISSTA 2013. New York, NY , USA: Association for Computing Machinery, 2013, p. 314–324. [Online]. Available: https://do...

  24. [24]

    Impact of code language models on automated program repair,

    E. Soremekun, L. Kirschner, M. Böhme, and M. Papadakis, “Evaluating the impact of experimental assumptions in automated fault localization,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. IEEE Press, 2023, p. 159–171. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00025

  25. [25]

    Are automated debugging techniques actually helping programmers?

    C. Parnin and A. Orso, “Are automated debugging techniques actually helping programmers?” inProceedings of the 2011 International Symposium on Software Testing and Analysis, ser. ISSTA ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 199–209. [Online]. Available: https://doi.org/10.1145/2001420.2001445

  26. [26]

    A quantitative and qualitative evaluation of llm-based explainable fault localization,

    S. Kang, G. An, and S. Yoo, “A quantitative and qualitative evaluation of llm-based explainable fault localization,”Proc. ACM Softw. Eng., vol. 1, no. FSE, jul 2024. [Online]. Available: https://doi.org/10.1145/3660771