How do Execution Features Improve Statistical Fault Localization? An Empirical Study
Pith reviewed 2026-06-30 05:12 UTC · model grok-4.3
The pith
Augmenting statistical fault localization with execution features improves accuracy and reduces inspection effort.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study shows that execution features extracted with EFDD and weighted by random-forest importances can be mapped to lines and combined with standard SFL formulas to produce rankings that improve reference-patch accuracy, lower line- and function-level effort, increase robustness, and remain feasible under a mixed-effects analysis on the chosen subjects.
What carries the argument
The mapping of per-subject random-forest importances on EFDD execution features to source-line weights that are then added to SFL suspiciousness scores.
If this is right
- Reference-patch accuracy rises when execution-feature weights are included.
- Both line-level and function-level developer effort decrease under the augmented rankings.
- The gains hold after adjusting for subject-level confounders in a mixed-effects model.
- The approach remains computationally feasible on the evaluated test subjects.
Where Pith is reading between the lines
- If the mapping works reliably, execution features may supply causal signals that pure spectra lack.
- The method could be tested on subjects outside Tests4Py without retraining per project.
- Similar weighting might apply to other spectrum-based techniques beyond the formulas tested here.
Load-bearing premise
The per-subject random-forest models produce importances that can be mapped back to lines and combined with SFL formulas without introducing new biases or needing subject-specific tuning that fails to generalize.
What would settle it
A replication on new subjects where the augmented rankings show no gain in reference-patch accuracy or require equal or greater inspection effort compared with plain SFL would falsify the improvement.
Figures
read the original abstract
Automated fault localization helps developers find faults in large code bases. Statistical fault localization (SFL) ranks suspicious lines from pass/fail spectra, but line execution alone misses information like data-flow, values, or branch conditions that explain why a failure occurs. This study evaluates whether augmenting SFL with execution features improves localization accuracy and developer-oriented inspection effort. We extract execution features with EFDD for all Tests4Py subjects, train per-subject random forests, map importances to source lines, and combine the resulting weights with established SFL formulas. The evaluation measures reference-patch accuracy, line- and function-level effort, robustness, and feasibility using a confounder-adjusted mixed-effects model, corroborated by paired statistical tests and outcome-neutral quality checks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that augmenting statistical fault localization (SFL) with execution features (via EFDD) improves accuracy and reduces developer inspection effort. Per-subject random forests are trained on EFDD features from Tests4Py subjects; feature importances are mapped to source lines and linearly combined with classic SFL formulas. Evaluation uses reference-patch accuracy, line- and function-level effort, robustness, and feasibility, analyzed via a confounder-adjusted mixed-effects model plus paired tests and outcome-neutral checks.
Significance. If the central results hold, the work would supply concrete empirical support for execution-feature augmentation of SFL, with direct implications for tool design. Credit is due for the confounder-adjusted mixed-effects modeling, paired statistical tests, and explicit outcome-neutral quality checks, all of which raise the evidential standard above typical SFL experiments that rely on raw rankings alone.
major comments (1)
- [§4] §4 (Mapping step): the per-subject random-forest training followed by importance-to-line mapping is performed after training and is not described as using a fixed, subject-independent rule. Because the mixed-effects model in §5.3 controls only for subject and test-suite size, any subject-specific alignment artifact introduced at the mapping stage remains unisolated and could account for measured gains in reference-patch accuracy or effort metrics. A sensitivity analysis that replaces the learned mapping with a uniform rule would directly test whether the reported improvements are attributable to execution features.
minor comments (1)
- [Abstract] Abstract: the phrase 'outcome-neutral quality checks' is used without definition; a one-sentence gloss or pointer to the relevant subsection would prevent reader uncertainty.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the mapping step. We address the major comment point-by-point below.
read point-by-point responses
-
Referee: [§4] §4 (Mapping step): the per-subject random-forest training followed by importance-to-line mapping is performed after training and is not described as using a fixed, subject-independent rule. Because the mixed-effects model in §5.3 controls only for subject and test-suite size, any subject-specific alignment artifact introduced at the mapping stage remains unisolated and could account for measured gains in reference-patch accuracy or effort metrics. A sensitivity analysis that replaces the learned mapping with a uniform rule would directly test whether the reported improvements are attributable to execution features.
Authors: We agree that the per-subject random-forest training and subsequent importance-to-line mapping could introduce subject-specific effects that the current mixed-effects model (controlling only for subject and test-suite size) does not fully isolate. The mapping is an intentional component of our approach, as it derives line-level weights directly from the learned relevance of execution features extracted by EFDD. Nevertheless, to directly test whether observed gains in reference-patch accuracy and effort metrics are attributable to the execution features rather than the learned mapping procedure, we will add the suggested sensitivity analysis. We will replace the RF-derived importances with a uniform, subject-independent rule (e.g., equal weighting across mapped lines or a fixed heuristic independent of per-subject training) and re-evaluate all metrics using the same statistical pipeline. Results will be reported in an expanded §5 with updated tables and discussion. This revision will strengthen the causal attribution to execution features. revision: yes
Circularity Check
Empirical study relies on external benchmarks with no self-referential derivations
full rationale
The paper describes an empirical pipeline: extracting EFDD features from Tests4Py subjects, training per-subject random forests, mapping importances to lines, combining with SFL formulas, and evaluating via confounder-adjusted mixed-effects models against reference patches. No equations, predictions, or first-principles results reduce to inputs by construction. All measurements use external data and statistical tests independent of the fitted values. The mapping step is a methodological choice whose effects are assessed externally rather than defined into the outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption EFDD extraction and random-forest importance mapping produce line-level weights that are commensurable with existing SFL formulas.
Reference graph
Works this paper leans on
-
[1]
Visualization of test information to assist fault localization,
J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test information to assist fault localization,” inProceedings of the 24th International Conference on Software Engineering. New York, NY , USA: ACM, 2002, pp. 467–477. [Online]. Available: https://doi.org/10.1145/581339.581397
-
[2]
Lightweight fault-localization using multiple coverage types,
R. Santelices, J. A. Jones, Y . Yu, and M. J. Harrold, “Lightweight fault-localization using multiple coverage types,” inProceedings of the 31st International Conference on Software Engineering, ser. ICSE ’09. USA: IEEE Computer Society, 2009, p. 56–66. [Online]. Available: https://doi.org/10.1109/ICSE.2009.5070508
-
[3]
Scalable statistical bug isolation,
B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, “Scalable statistical bug isolation,”SIGPLAN Not., vol. 40, no. 6, p. 15–26, jun
-
[4]
Available: https://doi.org/10.1145/1064978.1065014
[Online]. Available: https://doi.org/10.1145/1064978.1065014
-
[5]
Empirical evaluation of the Tarantula automatic fault-localization technique,
J. A. Jones and M. J. Harrold, “Empirical evaluation of the Tarantula automatic fault-localization technique,” inProceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’05. New York, NY , USA: Association for Computing Machinery, 2005, p. 273–282. [Online]. Available: https://doi.org/10.1145/1101908.1101949
-
[6]
An evaluation of similarity coefficients for software fault localization,
R. Abreu, P. Zoeteweij, and A. J. C. v. Gemund, “An evaluation of similarity coefficients for software fault localization,” inProceedings of the 12th Pacific Rim International Symposium on Dependable Computing, ser. PRDC ’06. USA: IEEE Computer Society, 2006, p. 39–46. [Online]. Available: https://doi.org/10.1109/PRDC.2006.18
-
[7]
Software fault localization using DStar (D*),
W. E. Wong, V . Debroy, Y . Li, and R. Gao, “Software fault localization using DStar (D*),” in2012 IEEE Sixth International Conference on Software Security and Reliability, 2012, pp. 21–30
2012
-
[8]
A model for spectra-based software diagnosis,
L. Naish, H. J. Lee, and K. Ramamohanarao, “A model for spectra-based software diagnosis,”ACM Trans. Softw. Eng. Methodol., vol. 20, no. 3, aug 2011. [Online]. Available: https://doi.org/10.1145/2000791.2000795
-
[9]
On the accuracy of spectrum-based fault localization,
R. Abreu, P. Zoeteweij, and A. J. van Gemund, “On the accuracy of spectrum-based fault localization,” inTesting: Academic and In- dustrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007), 2007, pp. 89–98
2007
-
[10]
SFLKit: A workbench for statistical fault localization,
M. Smytzek and A. Zeller, “SFLKit: A workbench for statistical fault localization,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY , USA: Association for Computing Machinery, 2022, p. 1701–1705. [Online]. Available: https://doi.org/10.1...
-
[11]
Evaluating and improving fault localization,
S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” inProceedings of the 39th International Conference on Software Engineering, ser. ICSE ’17. IEEE Press, 2017, p. 609–620. [Online]. Available: https://doi.org/10.1109/ICSE.2017.62
-
[12]
Boosting spectrum- based fault localization using PageRank,
M. Zhang, X. Li, L. Zhang, and S. Khurshid, “Boosting spectrum- based fault localization using PageRank,” inProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2017. New York, NY , USA: Association for Computing Machinery, 2017, p. 261–272. [Online]. Available: https://doi.org/10.1145/3092703.3092731
-
[13]
Provably optimal and human-competitive results in sbse for spectrum based fault localisation,
X. Xie, F.-C. Kuo, T. Y . Chen, S. Yoo, and M. Harman, “Provably optimal and human-competitive results in sbse for spectrum based fault localisation,” inProceedings of the 5th International Symposium on Search Based Software Engineering - Volume 8084, ser. SSBSE
-
[14]
Berlin, Heidelberg: Springer-Verlag, 2013, p. 224–238. [Online]. Available: https://doi.org/10.1007/978-3-642-39742-4_17
-
[15]
Constrained feature selection for localizing faults,
T.-D. B. Le, D. Lo, and M. Li, “Constrained feature selection for localizing faults,” inProceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), ser. ICSME ’15. USA: IEEE Computer Society, 2015, p. 501–505. [Online]. Available: https://doi.org/10.1109/ICSM.2015.7332502
-
[16]
Precise learn-to-rank fault localization using dynamic and static features of target programs,
Y . Kim, S. Mun, S. Yoo, and M. Kim, “Precise learn-to-rank fault localization using dynamic and static features of target programs,” ACM Trans. Softw. Eng. Methodol., vol. 28, no. 4, oct 2019. [Online]. Available: https://doi.org/10.1145/3345628
-
[17]
J. Jiang, R. Wang, Y . Xiong, X. Chen, and L. Zhang, “Combining spectrum-based fault localization and statistical debugging: An empirical study,” inProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’19. IEEE Press, 2019, p. 502–514. [Online]. Available: https://doi.org/10.1109/ASE.2019.00054
-
[18]
Transforming programs and tests in tandem for fault localization,
X. Li and L. Zhang, “Transforming programs and tests in tandem for fault localization,”Proc. ACM Program. Lang., vol. 1, no. OOPSLA, Oct. 2017. [Online]. Available: https://doi.org/10.1145/3133916
-
[19]
How execution features relate to failures: An empirical study and diagnosis approach,
M. Smytzek, M. Eberlein, L. Grunske, and A. Zeller, “How execution features relate to failures: An empirical study and diagnosis approach,” ACM Trans. Softw. Eng. Methodol., Dec. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3783989
-
[20]
Locating faults with program slicing: an empirical analysis,
E. Soremekun, L. Kirschner, M. Böhme, and A. Zeller, “Locating faults with program slicing: an empirical analysis,”Empirical Softw. Engg., vol. 26, no. 3, may 2021. [Online]. Available: https://doi.org/10.1007/s10664-020-09931-7
-
[21]
Improving the effectiveness of spectra-based fault localization using specifications,
D. Gopinath, R. N. Zaeem, and S. Khurshid, “Improving the effectiveness of spectra-based fault localization using specifications,” inProceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’12. New York, NY , USA: Association for Computing Machinery, 2012, p. 40–49. [Online]. Available: https://doi.org/10.1145...
-
[22]
A survey of learning-based automated program repair,
Q. Zhang, C. Fang, Y . Ma, W. Sun, and Z. Chen, “A survey of learning-based automated program repair,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 2, dec 2023. [Online]. Available: https://doi.org/10.1145/3631974
-
[23]
F. Steimann, M. Frenkel, and R. Abreu, “Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators,” inProceedings of the 2013 International Symposium on Software Testing and Analysis, ser. ISSTA 2013. New York, NY , USA: Association for Computing Machinery, 2013, p. 314–324. [Online]. Available: https://do...
-
[24]
Impact of code language models on automated program repair,
E. Soremekun, L. Kirschner, M. Böhme, and M. Papadakis, “Evaluating the impact of experimental assumptions in automated fault localization,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. IEEE Press, 2023, p. 159–171. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00025
-
[25]
Are automated debugging techniques actually helping programmers?
C. Parnin and A. Orso, “Are automated debugging techniques actually helping programmers?” inProceedings of the 2011 International Symposium on Software Testing and Analysis, ser. ISSTA ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 199–209. [Online]. Available: https://doi.org/10.1145/2001420.2001445
-
[26]
A quantitative and qualitative evaluation of llm-based explainable fault localization,
S. Kang, G. An, and S. Yoo, “A quantitative and qualitative evaluation of llm-based explainable fault localization,”Proc. ACM Softw. Eng., vol. 1, no. FSE, jul 2024. [Online]. Available: https://doi.org/10.1145/3660771
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.