Are We Lost in the Woods? Detecting Silent Semantic Faults for Random Forest Classifiers with Data-informed Static Analysis

Daniel Varro; Kristian Sandahl; Louis Ohl; Willem Meijer

arxiv: 2606.07709 · v1 · pith:QICNPVM7new · submitted 2026-06-05 · 💻 cs.SE

Are We Lost in the Woods? Detecting Silent Semantic Faults for Random Forest Classifiers with Data-informed Static Analysis

Willem Meijer , Louis Ohl , Kristian Sandahl , Daniel Varro This is my paper

Pith reviewed 2026-06-27 21:22 UTC · model grok-4.3

classification 💻 cs.SE

keywords static analysisrandom forestsemantic faultsmachine learningdata-informed analysisAPI contractsML pipelinessilent faults

0 comments

The pith

A static analysis technique catches silent semantic faults in random forest scripts before training by checking formalized contracts on pipeline graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a data-informed static analysis method to find semantic faults such as imbalanced datasets in machine learning code that uses random forest classifiers. These faults reduce prediction quality yet produce no obvious errors during development and are usually noticed only after full training runs. The technique converts ML scripts into directed acyclic graphs and evaluates them against contracts that cover structural, data, and hyperparameter problems, relying solely on aggregated data properties rather than the raw dataset. Evaluation on real Kaggle notebooks shows the method flags relevant faults at 91 percent precision with sub-second cost. The study also reports that 12 to 18 percent of such notebooks contain these faults.

Core claim

The authors establish that formalized API contracts for the random forest classifier can be checked through static analysis of extracted pipeline graphs using only aggregated data properties, thereby detecting structural, data, and hyperparameter faults without executing the training step or accessing the original dataset.

What carries the argument

Formalized API contracts evaluated on directed acyclic graphs of ML pipelines using aggregated data properties.

If this is right

The analysis runs with sub-second overhead and can be added to integrated development environments and continuous integration pipelines.
Between 12 and 18 percent of public notebooks that use random forest contain silent semantic faults.
Fault detection works even when raw training data cannot be shared because of confidentiality rules.
Early detection prevents wasted compute on training runs that would otherwise fail due to undetected data or hyperparameter problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contract-based approach could be extended to other classifier families such as gradient boosting or neural networks.
Widespread adoption might reduce the total number of training cycles needed during model development.
The reported fault rate in public notebooks suggests similar issues exist in private codebases that could benefit from automated checks.
Integration into automated agent workflows would allow scripts to be corrected before any model training begins.

Load-bearing premise

The formalized API contracts accurately capture the structural, data, and hyperparameter faults that matter in practice.

What would settle it

A hand audit of a sample of random forest notebooks that finds many faults the tool misses or flags incorrectly would show the contracts and detection logic do not match real semantic issues.

Figures

Figures reproduced from arXiv: 2606.07709 by Daniel Varro, Kristian Sandahl, Louis Ohl, Willem Meijer.

**Figure 3.** Figure 3: The data collection and filtering process used to create [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Relationship between dataset size and speedup in data [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

While machine learning (ML) software necessitates effective quality assurance, ML engineers still encounter silent semantic faults, such as imbalanced datasets, that degrade prediction performance without apparent symptoms. These faults are typically detected after expensive training cycles, causing significant resource waste. We propose a data-informed static analysis technique to detect silent semantic faults in ML scripts that use the popular random forest classifier. Our approach extracts ML pipelines into directed acyclic graphs and evaluates them against formalized API contracts to detect structural, data, and hyperparameter faults. Our analysis uses aggregated data properties, enabling fault detection even when datasets are inaccessible due to confidentiality restrictions. We implemented this technique in an open-source tool, dille, and evaluated it on real-world Kaggle notebooks that use the random forest classifier. Our results demonstrate that the tool identifies relevant semantic faults with 91% precision and sub-second runtime overhead, making it suitable for integration into integrated development environments, agentic workflows, and continuous integration pipelines. Our empirical study reveals that 12% to 18% of existing ML notebooks that use the random forest classifier are affected by silent semantic faults, highlighting the immediate practical utility of data-informed static analysis in reducing the burden of ML debugging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a static checker for common random forest pipeline faults that runs on aggregated data summaries, ships an open tool, and reports 91% precision plus 12-18% prevalence on Kaggle notebooks, but the ground-truth labeling for those faults is not described in enough detail to judge whether the number holds up.

read the letter

The core idea is to turn an ML script into a DAG, then check it against contracts that encode structural, data, and hyperparameter rules for random forests, using only summary statistics so the check can run even when the raw data is private. They implemented this in dille and ran it on real Kaggle notebooks.

What the work actually delivers is a concrete, open-source implementation plus runtime numbers that are low enough for IDE or CI use. The prevalence finding is also useful as a rough signal that these silent issues are not rare.

The evaluation is the main soft spot. The 91% precision figure depends on how the authors decided which detected faults counted as relevant. The abstract gives no information on whether that decision was based on measured accuracy drops after fixes, author judgment alone, or some other procedure, and there is no ablation showing that the aggregated properties recover the same faults that full-data inspection would find. If either step is weak, the precision number does not yet demonstrate practical value.

The contracts themselves look plausible for the faults they target, but without more on how they were constructed or tested against real bugs, it is hard to know their coverage.

This is the sort of applied software-engineering paper that belongs in a referee process. The idea is grounded enough and the tooling is real enough that reviewers can usefully pressure the evaluation details. I would bring it to a reading group to talk through the contract design and labeling method, but I would not cite it in my own work without seeing stronger validation.

Referee Report

3 major / 1 minor

Summary. The paper presents dille, a data-informed static analysis tool that extracts ML pipelines using random forest classifiers into DAGs and checks them against formalized API contracts for structural, data, and hyperparameter faults. Contracts are evaluated using only aggregated data properties (no raw data access required). On a corpus of real-world Kaggle notebooks the tool reports 91% precision at identifying relevant silent semantic faults, sub-second overhead, and an estimated 12-18% prevalence of such faults in existing notebooks.

Significance. If the empirical claims hold, the work provides a practical, early-detection method for a class of ML faults that are otherwise discovered only after costly training. The open-source implementation and focus on confidentiality-preserving analysis (via aggregates) are concrete strengths that could support IDE and CI integration.

major comments (3)

[Evaluation] Evaluation section: the procedure used to label ground-truth 'relevant' semantic faults in the Kaggle notebooks is not described (e.g., whether faults were validated by measuring accuracy drop after correction, by blinded review, or by author judgment alone). This labeling step is load-bearing for the central 91% precision claim.
[Evaluation] Evaluation section: no statement is made about whether the reported precision was measured on held-out notebooks or whether any post-hoc filtering of results occurred; without this information the external validity of the 91% figure cannot be assessed.
[Approach] Approach section: the sufficiency of aggregated data properties for contract evaluation is asserted but no ablation comparing aggregate-based detection against full-data detection is provided; this directly affects the confidentiality claim.

minor comments (1)

[Evaluation] The abstract states '12% to 18%' prevalence; the corresponding section should clarify whether this range reflects different contract thresholds or different notebook subsets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation and approach. The comments highlight important aspects of methodological transparency that we will address through revisions to improve clarity without altering the core claims or results.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the procedure used to label ground-truth 'relevant' semantic faults in the Kaggle notebooks is not described (e.g., whether faults were validated by measuring accuracy drop after correction, by blinded review, or by author judgment alone). This labeling step is load-bearing for the central 91% precision claim.

Authors: We agree that the ground-truth labeling procedure requires explicit description, as it underpins the 91% precision result. Labeling was performed via author judgment: each flagged fault was manually reviewed in context of the notebook to assess whether it violated a contract in a way that could plausibly degrade random forest performance (e.g., severe class imbalance or invalid hyperparameter combinations), drawing on standard ML literature. No accuracy-drop measurements or blinded reviews were conducted. We will add a dedicated subsection in the revised Evaluation section describing the criteria, process, and examples. revision: yes
Referee: [Evaluation] Evaluation section: no statement is made about whether the reported precision was measured on held-out notebooks or whether any post-hoc filtering of results occurred; without this information the external validity of the 91% figure cannot be assessed.

Authors: The 91% precision figure was computed over the full corpus of collected Kaggle notebooks with no held-out split and no post-hoc filtering of detections. This choice reflects the study's goal of estimating real-world prevalence rather than training a predictive model. We will revise the Evaluation section to state this explicitly and add a short discussion of external-validity implications and threats to generalizability. revision: yes
Referee: [Approach] Approach section: the sufficiency of aggregated data properties for contract evaluation is asserted but no ablation comparing aggregate-based detection against full-data detection is provided; this directly affects the confidentiality claim.

Authors: An empirical ablation against full-data detection is not feasible for the Kaggle corpus, as many notebooks lack raw-data access—the exact setting our confidentiality-preserving design targets. The contracts were intentionally limited to properties (means, variances, class counts, etc.) that aggregates can supply. We will expand the Approach section with a justification of why these aggregates suffice for the targeted faults and will note the lack of an ablation study as a limitation for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on external Kaggle notebooks are independent of self-defined inputs.

full rationale

The paper's derivation consists of formalizing API contracts for random forest pipelines, extracting them as DAGs, and evaluating against those contracts using aggregated data properties. The central claim of 91% precision is obtained by running the resulting tool on external real-world Kaggle notebooks, with no equations, fitted parameters, or self-citations shown to reduce the reported precision or prevalence figures to the authors' own definitions by construction. The evaluation data and labeling procedure are external to the paper's own artifacts, satisfying the criteria for a self-contained, non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that ML pipelines can be faithfully extracted as DAGs and that API contracts can be formalized to match real semantic faults; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption ML pipelines using random forest can be represented as directed acyclic graphs that capture data flow and API calls
Stated in the approach description as the basis for extraction and contract checking
domain assumption Aggregated data properties are sufficient to evaluate data-related faults without raw dataset access
Explicitly invoked to enable analysis under confidentiality restrictions

pith-pipeline@v0.9.1-grok · 5752 in / 1378 out tokens · 15892 ms · 2026-06-27T21:22:07.567516+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 37 canonical work pages

[1]

OECD Publishing, May 2025

OECD/BCG/INSEAD,The Adoption of Artificial Intelligence in Firms: New Evidence for Policymaking. OECD Publishing, May 2025. [Online]. Available: http://dx.doi.org/10.1787/f9ef33c3-en

work page doi:10.1787/f9ef33c3-en 2025
[2]

Analyzing AI adoption in European SMEs: A study of digital capabilities, innovation, and external environment,

M. F. Arroyabe, C. F. Arranz, I. Fernandez De Arroyabe, and J. C. Fernandez de Arroyabe, “Analyzing AI adoption in European SMEs: A study of digital capabilities, innovation, and external environment,” Technology in Society, vol. 79, p. 102733, Dec. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.techsoc. 2024.102733

work page doi:10.1016/j.techsoc 2024
[3]

Maintainability challenges in ML: A systematic literature review,

K. Shivashankar and A. Martini, “Maintainability challenges in ML: A systematic literature review,” in2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, Aug. 2022, p. 60–67. [Online]. Available: http://dx.doi. org/10.1109/SEAA56994.2022.00018

work page doi:10.1109/seaa56994.2022.00018 2022
[4]

Santhanam,Quality Management of Machine Learning Systems

P. Santhanam,Quality Management of Machine Learning Systems. Springer International Publishing, 2020, p. 1–13. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-030-62144-5 1

2020
[5]

Characterizing technical debt and antipatterns in AI-based systems: A systematic mapping study,

J. Bogner, R. Verdecchia, and I. Gerostathopoulos, “Characterizing technical debt and antipatterns in AI-based systems: A systematic mapping study,” in2021 IEEE/ACM International Conference on Technical Debt (TechDebt). IEEE, May 2021, p. 64–73. [Online]. Available: http://dx.doi.org/10.1109/ TechDebt52882.2021.00016

arXiv 2021
[6]

Quality issues in machine learning software systems,

P.-O. C ˆot´e, A. Nikanjam, R. Bouchoucha, I. Basta, M. Abidi, and F. Khomh, “Quality issues in machine learning software systems,”Empirical Software Engineering, vol. 29, no. 6, 2024. [Online]. Available: http://dx.doi.org/10.1007/s10664-024-10536-7

work page doi:10.1007/s10664-024-10536-7 2024
[7]

Software engineering practices for machine learning — adoption, effects, and team assessment,

A. Serban, K. van der Blom, H. Hoos, and J. Visser, “Software engineering practices for machine learning — adoption, effects, and team assessment,”Journal of Systems and Software, vol. 209, p. 111907, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023. 111907

work page doi:10.1016/j.jss.2023 2024
[8]

Data collection and quality challenges in deep learning: a data- centric AI perspective,

S. E. Whang, Y . Roh, H. Song, and J.-G. Lee, “Data collection and quality challenges in deep learning: a data- centric AI perspective,”The VLDB Journal, vol. 32, no. 4, p. 791–813, Jan. 2023. [Online]. Available: http://dx.doi.org/10.1007/s00778-022-00775-9

work page doi:10.1007/s00778-022-00775-9 2023
[9]

Opportunities and challenges in data- centric AI,

S. Kumar, S. Datta, V . Singh, S. K. Singh, and R. Sharma, “Opportunities and challenges in data- centric AI,”IEEE Access, vol. 12, p. 33173–33189,
[10]

Available: http://dx.doi.org/10.1109/ ACCESS.2024.3369417

[Online]. Available: http://dx.doi.org/10.1109/ ACCESS.2024.3369417

arXiv 2024
[11]

Testing machine learning and deep learning systems: Achievements and challenges,

S. Albelali and M. Ahmed, “Testing machine learning and deep learning systems: Achievements and challenges,”Arabian Journal for Science and Engineering, vol. 50, no. 15, p. 11433–11484,
[12]

Available: http://dx.doi.org/10.1007/ s13369-025-10276-w

[Online]. Available: http://dx.doi.org/10.1007/ s13369-025-10276-w
[13]

Architecting ML-enabled systems: Challenges, best practices, and design decisions,

R. Nazir, A. Bucaioni, and P. Pelliccione, “Architecting ML-enabled systems: Challenges, best practices, and design decisions,”Journal of Systems and Software, vol. 207, p. 111860, Jan. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023.111860

work page doi:10.1016/j.jss.2023.111860 2024
[14]

A checklist of quality concerns for architecting ML- intensive systems,

A. Bucaioni, R. Kazman, and P. Pelliccione, “A checklist of quality concerns for architecting ML- intensive systems,”Journal of Systems and Software, vol. 231, p. 112612, Jan. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2025.112612

work page doi:10.1016/j.jss.2025.112612 2026
[15]

Data-aware static analysis: Improving detection of semantic faults in ma- chine learning code using data characteristics,

W. Meijer, K. Sandahl, and D. Varr ´o, “Data-aware static analysis: Improving detection of semantic faults in ma- chine learning code using data characteristics,” inIn 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE-NIER ’26). ACM, 2026

2026
[16]

Taxonomy of real faults in deep learning systems,

N. Humbatova, G. Jahangirova, G. Bavota, V . Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, 2020, p. 1110–1121. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380395

work page doi:10.1145/3377811.3380395 2020
[17]

Bug analysis in Jupyter notebook projects: An empirical study,

T. L. De Santana, P. A. D. M. S. Neto, E. S. De Almeida, and I. Ahmed, “Bug analysis in Jupyter notebook projects: An empirical study,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, p. 1–34, Apr. 2024. [Online]. Available: http://dx.doi.org/10.1145/3641539

work page doi:10.1145/3641539 2024
[18]

What kinds of contracts do ML APIs need?

S. S. Khairunnesa, S. Ahmed, S. M. Imtiaz, H. Rajan, and G. T. Leavens, “What kinds of contracts do ML APIs need?”Empirical Software Engineering, vol. 28, no. 6, Oct. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10320-z

work page doi:10.1007/s10664-023-10320-z 2023
[19]

Comparative analysis of real issues in open- source machine learning projects,

T. D. Lai, A. Simmons, S. Barnett, J.-G. Schneider, and R. Vasa, “Comparative analysis of real issues in open- source machine learning projects,”Empirical Software Engineering, vol. 29, no. 3, May 2024. [Online]. Avail- able: http://dx.doi.org/10.1007/s10664-024-10467-3

work page doi:10.1007/s10664-024-10467-3 2024
[20]

Bug characterization in machine learning-based systems,

M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. M. Jiang, “Bug characterization in machine learning-based systems,”Empirical Software Engineer- ing, vol. 29, no. 1, Dec. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10400-0

work page doi:10.1007/s10664-023-10400-0 2023
[21]

Refty: refinement types for valid deep learning models,

Y . Gao, Z. Li, H. Lin, H. Zhang, M. Wu, and M. Yang, “Refty: refinement types for valid deep learning models,” inProceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. ACM, May 2022, p. 1843–1855. [Online]. Available: http://dx.doi.org/10.1145/3510003.3510077

work page doi:10.1145/3510003.3510077 2022
[22]

Safe- DS: A domain specific language to make data science safe,

L. Reimann and G. Kniesel-W ¨unsche, “Safe- DS: A domain specific language to make data science safe,” in2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, May 2023, p. 72–77. [Online]. Available: http://dx.doi.org/10.1109/ICSE-NIER58687.2023.00019

work page doi:10.1109/icse-nier58687.2023.00019 2023
[23]

MLScent: A tool for anti-pattern detection in ML projects,

K. Shivashankar and A. Martini, “MLScent: A tool for anti-pattern detection in ML projects,” in 2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE, Apr. 2025, p. 150–160. [Online]. Available: http://dx.doi.org/10.1109/CAIN66642.2025.00026

work page doi:10.1109/cain66642.2025.00026 2025
[24]

Design by contract for deep learning APIs,

S. Ahmed, S. M. Imtiaz, S. S. Khairunnesa, B. D. Cruz, and H. Rajan, “Design by contract for deep learning APIs,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’23. ACM, Nov. 2023, p. 94–106. [Online]. Available: http://dx.doi.org/10.1145/3611643.3616247

work page doi:10.1145/3611643.3616247 2023
[25]

2025 40th

A. Turcotte and N. N. Mehta, “The fault in our stats,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Nov. 2025, p. 2491–2503. [Online]. Available: http: //dx.doi.org/10.1109/ASE63991.2025.00205

work page doi:10.1109/ase63991.2025.00205 2025
[26]

Breiman, Random forests, Mach

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, p. 5–32, Oct. 2001. [Online]. Available: http://dx.doi.org/10.1023/A:1010933404324

work page doi:10.1023/a:1010933404324 2001
[27]

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology,

E. W. Fox, R. A. Hill, S. G. Leibowitz, A. R. Olsen, D. J. Thornbrugh, and M. H. Weber, “Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology,”Environmental Monitoring and Assessment, vol. 189, no. 7, 2017. [Online]. Available: http://dx.doi.org/10.1007/s10661-017-6025-0

work page doi:10.1007/s10661-017-6025-0 2017
[28]

Random forest and feature importance measures for discriminating the most influential environmental factors in predicting cardiovascular and respiratory diseases,

F. Cappelli, G. Castronuovo, S. Grimaldi, and V . Telesca, “Random forest and feature importance measures for discriminating the most influential environmental factors in predicting cardiovascular and respiratory diseases,” International Journal of Environmental Research and Public Health, vol. 21, no. 7, p. 867, 2024. [Online]. Available: http://dx.doi.o...

work page doi:10.3390/ijerph21070867 2024
[29]

Why do tree-based models still outperform deep learning on typical tabular data?

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 507–520. [Online]. Available: https: //proceedings.neurips....

2022
[30]

Replication package for

W. Meijer, K. Sandahl, and D. Varr ´o, “Replication package for ”are we lost in the woods? detecting silent semantic faults for random forest classifiers with data- informed static analysis”,” 2026. [Online]. Available: https://doi.org/10.5281/zenodo.19344519

work page doi:10.5281/zenodo.19344519 2026
[31]

sklearn.ensemble.RandomForestClassifier — scikit-learn 1.8 documentation,

Scikit-learn, “sklearn.ensemble.RandomForestClassifier — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn. ensemble.RandomForestClassifier.html, 2025

2025
[32]

Hyperparameters and tuning strategies for random forest,

P. Probst, M. N. Wright, and A. Boulesteix, “Hyperparameters and tuning strategies for random forest,”WIREs Data Mining and Knowledge Discovery, vol. 9, no. 3, Jan. 2019. [Online]. Available: http://dx.doi.org/10.1002/widm.1301

work page doi:10.1002/widm.1301 2019
[33]

G. M. Weiss,Mining with Rare Cases. Springer US, 2009, p. 747–757. [Online]. Available: http: //dx.doi.org/10.1007/978-0-387-09823-4 38

work page doi:10.1007/978-0-387-09823-4 2009
[34]

D. A. Cieslak and N. V . Chawla,Learning Decision Trees for Unbalanced Data. Springer Berlin Heidelberg, p. 241–256. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-540-87479-9 34
[35]

Using random forest to learn imbalanced data,

C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,”University of California, Berkeley, vol. 110, no. 1-12, p. 24, 2004

2004
[36]

The behaviour of random forest permutation- based variable importance measures under predictor correlation,

K. K. Nicodemus, J. D. Malley, C. Strobl, and A. Ziegler, “The behaviour of random forest permutation- based variable importance measures under predictor correlation,”BMC Bioinformatics, vol. 11, no. 1, Feb. 2010. [Online]. Available: http://dx.doi.org/10. 1186/1471-2105-11-110

2010
[37]

Prediction, estimation, and attribution,

B. Efron, “Prediction, estimation, and attribution,” International Statistical Review, vol. 88, no. S1, Dec
[38]

Available: http://dx.doi.org/10.1111/insr

[Online]. Available: http://dx.doi.org/10.1111/insr. 12409

work page doi:10.1111/insr
[39]

sklearn.preprocessing.LabelEncoder — scikit-learn 1.8 documentation,

Scikit-learn, “sklearn.preprocessing.LabelEncoder — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn. preprocessing.LabelEncoder.html, 2025

2025
[40]

Breiman, J

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,Classification And Regression Trees. Routledge, Oct. 2017. [Online]. Available: http://dx.doi.org/10.1201/ 9781315139470

2017
[41]

sklearn.tree.DecisionTreeClassifier — scikit-learn 1.8 documentation,

Scikit-learn, “sklearn.tree.DecisionTreeClassifier — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn.tree. DecisionTreeClassifier.html, 2025

2025
[42]

Kaggle notebook — health insurance prediction 94%,

M. Wiryaseputra, “Kaggle notebook — health insurance prediction 94%,” Kaggle. https://www.kaggle.com/code/ michaelwiryaseputra/health-insurance-prediction-94, 2023

2023
[43]

24 765:2017, 2017

ISO/IEC/IEEE,Systems and Software Engineering – Vo- cabulary, Std. 24 765:2017, 2017

2017
[44]

Scikit-learn: Machine learning in Python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in Python,”the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011

2011
[45]

PYRA: A high-level linter for data science software,

G. Dolcetti, V . Arceri, A. Mensi, E. Zaffanella, C. Urban, and A. Cortesi, “PYRA: A high-level linter for data science software,”Knowledge-Based Systems, vol. 337, p. 115412, Mar. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.knosys.2026.115412

work page doi:10.1016/j.knosys.2026.115412 2026
[46]

Expressing and checking statistical assumptions,

A. Turcotte and Z. Wu, “Expressing and checking statistical assumptions,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 2735–2758,
[47]

Available: http://dx.doi.org/10.1145/ 3729391

[Online]. Available: http://dx.doi.org/10.1145/ 3729391
[48]

Investigating and detecting silent bugs in PyTorch programs,

S. Hong, H. Sun, X. Gao, and S. H. Tan, “Investigating and detecting silent bugs in PyTorch programs,” in2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Mar. 2024, p. 272–283. [Online]. Available: http: //dx.doi.org/10.1109/SANER60148.2024.00035

work page doi:10.1109/saner60148.2024.00035 2024
[49]

Towards understanding fine-grained programming mistakes and fixing patterns in data science,

W.-H. Chen, J. L. Cheoh, M. Keim, S. Brunswicker, and T. Zhang, “Towards understanding fine-grained programming mistakes and fixing patterns in data science,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 1824–1846, 2025. [Online]. Available: http://dx.doi.org/10.1145/3729352

work page doi:10.1145/3729352 2025
[50]

Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,

Y . Wang, W. Meijer, J. A. H. L ´opez, U. Nilsson, and D. Varr ´o, “Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,”IEEE Transactions on Software Engineering, vol. 51, no. 7, p. 2181–2196, 2025. [Online]. Available: http://dx.doi.org/10.1109/TSE.2025.3574500

work page doi:10.1109/tse.2025.3574500 2025
[51]

An empirical study on TensorFlow program bugs,

Y . Zhang, Y . Chen, S.-C. Cheung, Y . Xiong, and L. Zhang, “An empirical study on TensorFlow program bugs,” inProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’18. ACM, 2018, p. 129–140. [Online]. Available: http://dx.doi.org/10.1145/3213846. 3213866

work page doi:10.1145/3213846 2018
[52]

A comprehensive study on deep learning bug characteristics,

M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study on deep learning bug characteristics,” inProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’19. ACM, Aug. 2019, p. 510–520. [Online]. Available: http://dx.doi.org/10.1145/3338...

work page doi:10.1145/3338906.3338955 2019
[53]

Empirical review of automated analysis tools on 47,587 ethereum smart contracts,

R. Zhang, W. Xiao, H. Zhang, Y . Liu, H. Lin, and M. Yang, “An empirical study on program failures of deep learning jobs,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, 2020, p. 1159–1170. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380362

work page doi:10.1145/3377811.3380362 2020
[54]

Why don’t software developers use static analysis tools to find bugs?

B. Johnson, Y . Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in2013 35th International Conference on Software Engineering (ICSE). IEEE, May 2013, p. 672–681. [Online]. Available: http: //dx.doi.org/10.1109/ICSE.2013.6606613

work page doi:10.1109/icse.2013.6606613 2013
[55]

Better code, better sharing: on the need of analyzing Jupyter notebooks,

J. Wang, L. Li, and A. Zeller, “Better code, better sharing: on the need of analyzing Jupyter notebooks,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE ’20. ACM, 2020, p. 53–56. [Online]. Available: http://dx.doi.org/10.1145/3377816. 3381724

work page doi:10.1145/3377816 2020
[56]

A large-scale study about quality and reproducibility of Jupyter notebooks,

J. F. Pimentel, L. Murta, V . Braganholo, and J. Freire, “A large-scale study about quality and reproducibility of Jupyter notebooks,” in2019 IEEE/ACM 16th Inter- national Conference on Mining Software Repositories (MSR). IEEE, May 2019, p. 507–517. [Online]. Available: http://dx.doi.org/10.1109/MSR.2019.00077

work page doi:10.1109/msr.2019.00077 2019
[57]

Code duplication and reuse in Jupyter notebooks,

A. P. Koenzen, N. A. Ernst, and M.-A. D. Storey, “Code duplication and reuse in Jupyter notebooks,” in2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, Aug. 2020, p. 1–9. [Online]. Available: http://dx.doi.org/10. 1109/VL/HCC50065.2020.9127202

arXiv 2020
[58]

Linear regression models with logarithmic transformations,

K. Benoit, “Linear regression models with logarithmic transformations,”London School of Economics, London, vol. 22, no. 1, pp. 23–36, 2011

2011
[59]

A coefficient of agreement for nominal scales,

J. Cohen, “A coefficient of agreement for nominal scales,”Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960

1960
[60]

Decoding the mystery: How can LLMs turn text into Cypher in complex knowledge graphs?IEEE Access, 13:80981–81001, 2025

M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced data problem in machine learning: A review,”IEEE Access, vol. 13, p. 13686–13699, 2025. [Online]. Available: http://dx.doi.org/10.1109/ACCESS. 2025.3531662

work page doi:10.1109/access 2025
[61]

Leakage and the reproducibility crisis in machine-learning-based science,

S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in machine-learning-based science,” Patterns, vol. 4, no. 9, p. 100804, 2023. [Online]. Available: http://dx.doi.org/10.1016/j.patter.2023.100804

work page doi:10.1016/j.patter.2023.100804 2023

[1] [1]

OECD Publishing, May 2025

OECD/BCG/INSEAD,The Adoption of Artificial Intelligence in Firms: New Evidence for Policymaking. OECD Publishing, May 2025. [Online]. Available: http://dx.doi.org/10.1787/f9ef33c3-en

work page doi:10.1787/f9ef33c3-en 2025

[2] [2]

Analyzing AI adoption in European SMEs: A study of digital capabilities, innovation, and external environment,

M. F. Arroyabe, C. F. Arranz, I. Fernandez De Arroyabe, and J. C. Fernandez de Arroyabe, “Analyzing AI adoption in European SMEs: A study of digital capabilities, innovation, and external environment,” Technology in Society, vol. 79, p. 102733, Dec. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.techsoc. 2024.102733

work page doi:10.1016/j.techsoc 2024

[3] [3]

Maintainability challenges in ML: A systematic literature review,

K. Shivashankar and A. Martini, “Maintainability challenges in ML: A systematic literature review,” in2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, Aug. 2022, p. 60–67. [Online]. Available: http://dx.doi. org/10.1109/SEAA56994.2022.00018

work page doi:10.1109/seaa56994.2022.00018 2022

[4] [4]

Santhanam,Quality Management of Machine Learning Systems

P. Santhanam,Quality Management of Machine Learning Systems. Springer International Publishing, 2020, p. 1–13. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-030-62144-5 1

2020

[5] [5]

Characterizing technical debt and antipatterns in AI-based systems: A systematic mapping study,

J. Bogner, R. Verdecchia, and I. Gerostathopoulos, “Characterizing technical debt and antipatterns in AI-based systems: A systematic mapping study,” in2021 IEEE/ACM International Conference on Technical Debt (TechDebt). IEEE, May 2021, p. 64–73. [Online]. Available: http://dx.doi.org/10.1109/ TechDebt52882.2021.00016

arXiv 2021

[6] [6]

Quality issues in machine learning software systems,

P.-O. C ˆot´e, A. Nikanjam, R. Bouchoucha, I. Basta, M. Abidi, and F. Khomh, “Quality issues in machine learning software systems,”Empirical Software Engineering, vol. 29, no. 6, 2024. [Online]. Available: http://dx.doi.org/10.1007/s10664-024-10536-7

work page doi:10.1007/s10664-024-10536-7 2024

[7] [7]

Software engineering practices for machine learning — adoption, effects, and team assessment,

A. Serban, K. van der Blom, H. Hoos, and J. Visser, “Software engineering practices for machine learning — adoption, effects, and team assessment,”Journal of Systems and Software, vol. 209, p. 111907, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023. 111907

work page doi:10.1016/j.jss.2023 2024

[8] [8]

Data collection and quality challenges in deep learning: a data- centric AI perspective,

S. E. Whang, Y . Roh, H. Song, and J.-G. Lee, “Data collection and quality challenges in deep learning: a data- centric AI perspective,”The VLDB Journal, vol. 32, no. 4, p. 791–813, Jan. 2023. [Online]. Available: http://dx.doi.org/10.1007/s00778-022-00775-9

work page doi:10.1007/s00778-022-00775-9 2023

[9] [9]

Opportunities and challenges in data- centric AI,

S. Kumar, S. Datta, V . Singh, S. K. Singh, and R. Sharma, “Opportunities and challenges in data- centric AI,”IEEE Access, vol. 12, p. 33173–33189,

[10] [10]

Available: http://dx.doi.org/10.1109/ ACCESS.2024.3369417

[Online]. Available: http://dx.doi.org/10.1109/ ACCESS.2024.3369417

arXiv 2024

[11] [11]

Testing machine learning and deep learning systems: Achievements and challenges,

S. Albelali and M. Ahmed, “Testing machine learning and deep learning systems: Achievements and challenges,”Arabian Journal for Science and Engineering, vol. 50, no. 15, p. 11433–11484,

[12] [12]

Available: http://dx.doi.org/10.1007/ s13369-025-10276-w

[Online]. Available: http://dx.doi.org/10.1007/ s13369-025-10276-w

[13] [13]

Architecting ML-enabled systems: Challenges, best practices, and design decisions,

R. Nazir, A. Bucaioni, and P. Pelliccione, “Architecting ML-enabled systems: Challenges, best practices, and design decisions,”Journal of Systems and Software, vol. 207, p. 111860, Jan. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023.111860

work page doi:10.1016/j.jss.2023.111860 2024

[14] [14]

A checklist of quality concerns for architecting ML- intensive systems,

A. Bucaioni, R. Kazman, and P. Pelliccione, “A checklist of quality concerns for architecting ML- intensive systems,”Journal of Systems and Software, vol. 231, p. 112612, Jan. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2025.112612

work page doi:10.1016/j.jss.2025.112612 2026

[15] [15]

Data-aware static analysis: Improving detection of semantic faults in ma- chine learning code using data characteristics,

W. Meijer, K. Sandahl, and D. Varr ´o, “Data-aware static analysis: Improving detection of semantic faults in ma- chine learning code using data characteristics,” inIn 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE-NIER ’26). ACM, 2026

2026

[16] [16]

Taxonomy of real faults in deep learning systems,

N. Humbatova, G. Jahangirova, G. Bavota, V . Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, 2020, p. 1110–1121. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380395

work page doi:10.1145/3377811.3380395 2020

[17] [17]

Bug analysis in Jupyter notebook projects: An empirical study,

T. L. De Santana, P. A. D. M. S. Neto, E. S. De Almeida, and I. Ahmed, “Bug analysis in Jupyter notebook projects: An empirical study,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, p. 1–34, Apr. 2024. [Online]. Available: http://dx.doi.org/10.1145/3641539

work page doi:10.1145/3641539 2024

[18] [18]

What kinds of contracts do ML APIs need?

S. S. Khairunnesa, S. Ahmed, S. M. Imtiaz, H. Rajan, and G. T. Leavens, “What kinds of contracts do ML APIs need?”Empirical Software Engineering, vol. 28, no. 6, Oct. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10320-z

work page doi:10.1007/s10664-023-10320-z 2023

[19] [19]

Comparative analysis of real issues in open- source machine learning projects,

T. D. Lai, A. Simmons, S. Barnett, J.-G. Schneider, and R. Vasa, “Comparative analysis of real issues in open- source machine learning projects,”Empirical Software Engineering, vol. 29, no. 3, May 2024. [Online]. Avail- able: http://dx.doi.org/10.1007/s10664-024-10467-3

work page doi:10.1007/s10664-024-10467-3 2024

[20] [20]

Bug characterization in machine learning-based systems,

M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. M. Jiang, “Bug characterization in machine learning-based systems,”Empirical Software Engineer- ing, vol. 29, no. 1, Dec. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10400-0

work page doi:10.1007/s10664-023-10400-0 2023

[21] [21]

Refty: refinement types for valid deep learning models,

Y . Gao, Z. Li, H. Lin, H. Zhang, M. Wu, and M. Yang, “Refty: refinement types for valid deep learning models,” inProceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. ACM, May 2022, p. 1843–1855. [Online]. Available: http://dx.doi.org/10.1145/3510003.3510077

work page doi:10.1145/3510003.3510077 2022

[22] [22]

Safe- DS: A domain specific language to make data science safe,

L. Reimann and G. Kniesel-W ¨unsche, “Safe- DS: A domain specific language to make data science safe,” in2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, May 2023, p. 72–77. [Online]. Available: http://dx.doi.org/10.1109/ICSE-NIER58687.2023.00019

work page doi:10.1109/icse-nier58687.2023.00019 2023

[23] [23]

MLScent: A tool for anti-pattern detection in ML projects,

K. Shivashankar and A. Martini, “MLScent: A tool for anti-pattern detection in ML projects,” in 2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE, Apr. 2025, p. 150–160. [Online]. Available: http://dx.doi.org/10.1109/CAIN66642.2025.00026

work page doi:10.1109/cain66642.2025.00026 2025

[24] [24]

Design by contract for deep learning APIs,

S. Ahmed, S. M. Imtiaz, S. S. Khairunnesa, B. D. Cruz, and H. Rajan, “Design by contract for deep learning APIs,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’23. ACM, Nov. 2023, p. 94–106. [Online]. Available: http://dx.doi.org/10.1145/3611643.3616247

work page doi:10.1145/3611643.3616247 2023

[25] [25]

2025 40th

A. Turcotte and N. N. Mehta, “The fault in our stats,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Nov. 2025, p. 2491–2503. [Online]. Available: http: //dx.doi.org/10.1109/ASE63991.2025.00205

work page doi:10.1109/ase63991.2025.00205 2025

[26] [26]

Breiman, Random forests, Mach

L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, p. 5–32, Oct. 2001. [Online]. Available: http://dx.doi.org/10.1023/A:1010933404324

work page doi:10.1023/a:1010933404324 2001

[27] [27]

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology,

E. W. Fox, R. A. Hill, S. G. Leibowitz, A. R. Olsen, D. J. Thornbrugh, and M. H. Weber, “Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology,”Environmental Monitoring and Assessment, vol. 189, no. 7, 2017. [Online]. Available: http://dx.doi.org/10.1007/s10661-017-6025-0

work page doi:10.1007/s10661-017-6025-0 2017

[28] [28]

Random forest and feature importance measures for discriminating the most influential environmental factors in predicting cardiovascular and respiratory diseases,

F. Cappelli, G. Castronuovo, S. Grimaldi, and V . Telesca, “Random forest and feature importance measures for discriminating the most influential environmental factors in predicting cardiovascular and respiratory diseases,” International Journal of Environmental Research and Public Health, vol. 21, no. 7, p. 867, 2024. [Online]. Available: http://dx.doi.o...

work page doi:10.3390/ijerph21070867 2024

[29] [29]

Why do tree-based models still outperform deep learning on typical tabular data?

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 507–520. [Online]. Available: https: //proceedings.neurips....

2022

[30] [30]

Replication package for

W. Meijer, K. Sandahl, and D. Varr ´o, “Replication package for ”are we lost in the woods? detecting silent semantic faults for random forest classifiers with data- informed static analysis”,” 2026. [Online]. Available: https://doi.org/10.5281/zenodo.19344519

work page doi:10.5281/zenodo.19344519 2026

[31] [31]

sklearn.ensemble.RandomForestClassifier — scikit-learn 1.8 documentation,

Scikit-learn, “sklearn.ensemble.RandomForestClassifier — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn. ensemble.RandomForestClassifier.html, 2025

2025

[32] [32]

Hyperparameters and tuning strategies for random forest,

P. Probst, M. N. Wright, and A. Boulesteix, “Hyperparameters and tuning strategies for random forest,”WIREs Data Mining and Knowledge Discovery, vol. 9, no. 3, Jan. 2019. [Online]. Available: http://dx.doi.org/10.1002/widm.1301

work page doi:10.1002/widm.1301 2019

[33] [33]

G. M. Weiss,Mining with Rare Cases. Springer US, 2009, p. 747–757. [Online]. Available: http: //dx.doi.org/10.1007/978-0-387-09823-4 38

work page doi:10.1007/978-0-387-09823-4 2009

[34] [34]

D. A. Cieslak and N. V . Chawla,Learning Decision Trees for Unbalanced Data. Springer Berlin Heidelberg, p. 241–256. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-540-87479-9 34

[35] [35]

Using random forest to learn imbalanced data,

C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,”University of California, Berkeley, vol. 110, no. 1-12, p. 24, 2004

2004

[36] [36]

The behaviour of random forest permutation- based variable importance measures under predictor correlation,

K. K. Nicodemus, J. D. Malley, C. Strobl, and A. Ziegler, “The behaviour of random forest permutation- based variable importance measures under predictor correlation,”BMC Bioinformatics, vol. 11, no. 1, Feb. 2010. [Online]. Available: http://dx.doi.org/10. 1186/1471-2105-11-110

2010

[37] [37]

Prediction, estimation, and attribution,

B. Efron, “Prediction, estimation, and attribution,” International Statistical Review, vol. 88, no. S1, Dec

[38] [38]

Available: http://dx.doi.org/10.1111/insr

[Online]. Available: http://dx.doi.org/10.1111/insr. 12409

work page doi:10.1111/insr

[39] [39]

sklearn.preprocessing.LabelEncoder — scikit-learn 1.8 documentation,

Scikit-learn, “sklearn.preprocessing.LabelEncoder — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn. preprocessing.LabelEncoder.html, 2025

2025

[40] [40]

Breiman, J

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,Classification And Regression Trees. Routledge, Oct. 2017. [Online]. Available: http://dx.doi.org/10.1201/ 9781315139470

2017

[41] [41]

sklearn.tree.DecisionTreeClassifier — scikit-learn 1.8 documentation,

Scikit-learn, “sklearn.tree.DecisionTreeClassifier — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn.tree. DecisionTreeClassifier.html, 2025

2025

[42] [42]

Kaggle notebook — health insurance prediction 94%,

M. Wiryaseputra, “Kaggle notebook — health insurance prediction 94%,” Kaggle. https://www.kaggle.com/code/ michaelwiryaseputra/health-insurance-prediction-94, 2023

2023

[43] [43]

24 765:2017, 2017

ISO/IEC/IEEE,Systems and Software Engineering – Vo- cabulary, Std. 24 765:2017, 2017

2017

[44] [44]

Scikit-learn: Machine learning in Python,

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in Python,”the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011

2011

[45] [45]

PYRA: A high-level linter for data science software,

G. Dolcetti, V . Arceri, A. Mensi, E. Zaffanella, C. Urban, and A. Cortesi, “PYRA: A high-level linter for data science software,”Knowledge-Based Systems, vol. 337, p. 115412, Mar. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.knosys.2026.115412

work page doi:10.1016/j.knosys.2026.115412 2026

[46] [46]

Expressing and checking statistical assumptions,

A. Turcotte and Z. Wu, “Expressing and checking statistical assumptions,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 2735–2758,

[47] [47]

Available: http://dx.doi.org/10.1145/ 3729391

[Online]. Available: http://dx.doi.org/10.1145/ 3729391

[48] [48]

Investigating and detecting silent bugs in PyTorch programs,

S. Hong, H. Sun, X. Gao, and S. H. Tan, “Investigating and detecting silent bugs in PyTorch programs,” in2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Mar. 2024, p. 272–283. [Online]. Available: http: //dx.doi.org/10.1109/SANER60148.2024.00035

work page doi:10.1109/saner60148.2024.00035 2024

[49] [49]

Towards understanding fine-grained programming mistakes and fixing patterns in data science,

W.-H. Chen, J. L. Cheoh, M. Keim, S. Brunswicker, and T. Zhang, “Towards understanding fine-grained programming mistakes and fixing patterns in data science,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 1824–1846, 2025. [Online]. Available: http://dx.doi.org/10.1145/3729352

work page doi:10.1145/3729352 2025

[50] [50]

Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,

Y . Wang, W. Meijer, J. A. H. L ´opez, U. Nilsson, and D. Varr ´o, “Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,”IEEE Transactions on Software Engineering, vol. 51, no. 7, p. 2181–2196, 2025. [Online]. Available: http://dx.doi.org/10.1109/TSE.2025.3574500

work page doi:10.1109/tse.2025.3574500 2025

[51] [51]

An empirical study on TensorFlow program bugs,

Y . Zhang, Y . Chen, S.-C. Cheung, Y . Xiong, and L. Zhang, “An empirical study on TensorFlow program bugs,” inProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’18. ACM, 2018, p. 129–140. [Online]. Available: http://dx.doi.org/10.1145/3213846. 3213866

work page doi:10.1145/3213846 2018

[52] [52]

A comprehensive study on deep learning bug characteristics,

M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study on deep learning bug characteristics,” inProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’19. ACM, Aug. 2019, p. 510–520. [Online]. Available: http://dx.doi.org/10.1145/3338...

work page doi:10.1145/3338906.3338955 2019

[53] [53]

Empirical review of automated analysis tools on 47,587 ethereum smart contracts,

R. Zhang, W. Xiao, H. Zhang, Y . Liu, H. Lin, and M. Yang, “An empirical study on program failures of deep learning jobs,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, 2020, p. 1159–1170. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380362

work page doi:10.1145/3377811.3380362 2020

[54] [54]

Why don’t software developers use static analysis tools to find bugs?

B. Johnson, Y . Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in2013 35th International Conference on Software Engineering (ICSE). IEEE, May 2013, p. 672–681. [Online]. Available: http: //dx.doi.org/10.1109/ICSE.2013.6606613

work page doi:10.1109/icse.2013.6606613 2013

[55] [55]

Better code, better sharing: on the need of analyzing Jupyter notebooks,

J. Wang, L. Li, and A. Zeller, “Better code, better sharing: on the need of analyzing Jupyter notebooks,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE ’20. ACM, 2020, p. 53–56. [Online]. Available: http://dx.doi.org/10.1145/3377816. 3381724

work page doi:10.1145/3377816 2020

[56] [56]

A large-scale study about quality and reproducibility of Jupyter notebooks,

J. F. Pimentel, L. Murta, V . Braganholo, and J. Freire, “A large-scale study about quality and reproducibility of Jupyter notebooks,” in2019 IEEE/ACM 16th Inter- national Conference on Mining Software Repositories (MSR). IEEE, May 2019, p. 507–517. [Online]. Available: http://dx.doi.org/10.1109/MSR.2019.00077

work page doi:10.1109/msr.2019.00077 2019

[57] [57]

Code duplication and reuse in Jupyter notebooks,

A. P. Koenzen, N. A. Ernst, and M.-A. D. Storey, “Code duplication and reuse in Jupyter notebooks,” in2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, Aug. 2020, p. 1–9. [Online]. Available: http://dx.doi.org/10. 1109/VL/HCC50065.2020.9127202

arXiv 2020

[58] [58]

Linear regression models with logarithmic transformations,

K. Benoit, “Linear regression models with logarithmic transformations,”London School of Economics, London, vol. 22, no. 1, pp. 23–36, 2011

2011

[59] [59]

A coefficient of agreement for nominal scales,

J. Cohen, “A coefficient of agreement for nominal scales,”Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960

1960

[60] [60]

Decoding the mystery: How can LLMs turn text into Cypher in complex knowledge graphs?IEEE Access, 13:80981–81001, 2025

M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced data problem in machine learning: A review,”IEEE Access, vol. 13, p. 13686–13699, 2025. [Online]. Available: http://dx.doi.org/10.1109/ACCESS. 2025.3531662

work page doi:10.1109/access 2025

[61] [61]

Leakage and the reproducibility crisis in machine-learning-based science,

S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in machine-learning-based science,” Patterns, vol. 4, no. 9, p. 100804, 2023. [Online]. Available: http://dx.doi.org/10.1016/j.patter.2023.100804

work page doi:10.1016/j.patter.2023.100804 2023