pith. sign in

arxiv: 2606.07709 · v1 · pith:QICNPVM7new · submitted 2026-06-05 · 💻 cs.SE

Are We Lost in the Woods? Detecting Silent Semantic Faults for Random Forest Classifiers with Data-informed Static Analysis

Pith reviewed 2026-06-27 21:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords static analysisrandom forestsemantic faultsmachine learningdata-informed analysisAPI contractsML pipelinessilent faults
0
0 comments X

The pith

A static analysis technique catches silent semantic faults in random forest scripts before training by checking formalized contracts on pipeline graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a data-informed static analysis method to find semantic faults such as imbalanced datasets in machine learning code that uses random forest classifiers. These faults reduce prediction quality yet produce no obvious errors during development and are usually noticed only after full training runs. The technique converts ML scripts into directed acyclic graphs and evaluates them against contracts that cover structural, data, and hyperparameter problems, relying solely on aggregated data properties rather than the raw dataset. Evaluation on real Kaggle notebooks shows the method flags relevant faults at 91 percent precision with sub-second cost. The study also reports that 12 to 18 percent of such notebooks contain these faults.

Core claim

The authors establish that formalized API contracts for the random forest classifier can be checked through static analysis of extracted pipeline graphs using only aggregated data properties, thereby detecting structural, data, and hyperparameter faults without executing the training step or accessing the original dataset.

What carries the argument

Formalized API contracts evaluated on directed acyclic graphs of ML pipelines using aggregated data properties.

If this is right

  • The analysis runs with sub-second overhead and can be added to integrated development environments and continuous integration pipelines.
  • Between 12 and 18 percent of public notebooks that use random forest contain silent semantic faults.
  • Fault detection works even when raw training data cannot be shared because of confidentiality rules.
  • Early detection prevents wasted compute on training runs that would otherwise fail due to undetected data or hyperparameter problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contract-based approach could be extended to other classifier families such as gradient boosting or neural networks.
  • Widespread adoption might reduce the total number of training cycles needed during model development.
  • The reported fault rate in public notebooks suggests similar issues exist in private codebases that could benefit from automated checks.
  • Integration into automated agent workflows would allow scripts to be corrected before any model training begins.

Load-bearing premise

The formalized API contracts accurately capture the structural, data, and hyperparameter faults that matter in practice.

What would settle it

A hand audit of a sample of random forest notebooks that finds many faults the tool misses or flags incorrectly would show the contracts and detection logic do not match real semantic issues.

Figures

Figures reproduced from arXiv: 2606.07709 by Daniel Varro, Kristian Sandahl, Louis Ohl, Willem Meijer.

Figure 1
Figure 1. Figure 1: The DAG corresponding to a slice of Listing 1 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: The data collection and filtering process used to create [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relationship between dataset size and speedup in data [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

While machine learning (ML) software necessitates effective quality assurance, ML engineers still encounter silent semantic faults, such as imbalanced datasets, that degrade prediction performance without apparent symptoms. These faults are typically detected after expensive training cycles, causing significant resource waste. We propose a data-informed static analysis technique to detect silent semantic faults in ML scripts that use the popular random forest classifier. Our approach extracts ML pipelines into directed acyclic graphs and evaluates them against formalized API contracts to detect structural, data, and hyperparameter faults. Our analysis uses aggregated data properties, enabling fault detection even when datasets are inaccessible due to confidentiality restrictions. We implemented this technique in an open-source tool, dille, and evaluated it on real-world Kaggle notebooks that use the random forest classifier. Our results demonstrate that the tool identifies relevant semantic faults with 91% precision and sub-second runtime overhead, making it suitable for integration into integrated development environments, agentic workflows, and continuous integration pipelines. Our empirical study reveals that 12% to 18% of existing ML notebooks that use the random forest classifier are affected by silent semantic faults, highlighting the immediate practical utility of data-informed static analysis in reducing the burden of ML debugging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents dille, a data-informed static analysis tool that extracts ML pipelines using random forest classifiers into DAGs and checks them against formalized API contracts for structural, data, and hyperparameter faults. Contracts are evaluated using only aggregated data properties (no raw data access required). On a corpus of real-world Kaggle notebooks the tool reports 91% precision at identifying relevant silent semantic faults, sub-second overhead, and an estimated 12-18% prevalence of such faults in existing notebooks.

Significance. If the empirical claims hold, the work provides a practical, early-detection method for a class of ML faults that are otherwise discovered only after costly training. The open-source implementation and focus on confidentiality-preserving analysis (via aggregates) are concrete strengths that could support IDE and CI integration.

major comments (3)
  1. [Evaluation] Evaluation section: the procedure used to label ground-truth 'relevant' semantic faults in the Kaggle notebooks is not described (e.g., whether faults were validated by measuring accuracy drop after correction, by blinded review, or by author judgment alone). This labeling step is load-bearing for the central 91% precision claim.
  2. [Evaluation] Evaluation section: no statement is made about whether the reported precision was measured on held-out notebooks or whether any post-hoc filtering of results occurred; without this information the external validity of the 91% figure cannot be assessed.
  3. [Approach] Approach section: the sufficiency of aggregated data properties for contract evaluation is asserted but no ablation comparing aggregate-based detection against full-data detection is provided; this directly affects the confidentiality claim.
minor comments (1)
  1. [Evaluation] The abstract states '12% to 18%' prevalence; the corresponding section should clarify whether this range reflects different contract thresholds or different notebook subsets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation and approach. The comments highlight important aspects of methodological transparency that we will address through revisions to improve clarity without altering the core claims or results.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the procedure used to label ground-truth 'relevant' semantic faults in the Kaggle notebooks is not described (e.g., whether faults were validated by measuring accuracy drop after correction, by blinded review, or by author judgment alone). This labeling step is load-bearing for the central 91% precision claim.

    Authors: We agree that the ground-truth labeling procedure requires explicit description, as it underpins the 91% precision result. Labeling was performed via author judgment: each flagged fault was manually reviewed in context of the notebook to assess whether it violated a contract in a way that could plausibly degrade random forest performance (e.g., severe class imbalance or invalid hyperparameter combinations), drawing on standard ML literature. No accuracy-drop measurements or blinded reviews were conducted. We will add a dedicated subsection in the revised Evaluation section describing the criteria, process, and examples. revision: yes

  2. Referee: [Evaluation] Evaluation section: no statement is made about whether the reported precision was measured on held-out notebooks or whether any post-hoc filtering of results occurred; without this information the external validity of the 91% figure cannot be assessed.

    Authors: The 91% precision figure was computed over the full corpus of collected Kaggle notebooks with no held-out split and no post-hoc filtering of detections. This choice reflects the study's goal of estimating real-world prevalence rather than training a predictive model. We will revise the Evaluation section to state this explicitly and add a short discussion of external-validity implications and threats to generalizability. revision: yes

  3. Referee: [Approach] Approach section: the sufficiency of aggregated data properties for contract evaluation is asserted but no ablation comparing aggregate-based detection against full-data detection is provided; this directly affects the confidentiality claim.

    Authors: An empirical ablation against full-data detection is not feasible for the Kaggle corpus, as many notebooks lack raw-data access—the exact setting our confidentiality-preserving design targets. The contracts were intentionally limited to properties (means, variances, class counts, etc.) that aggregates can supply. We will expand the Approach section with a justification of why these aggregates suffice for the targeted faults and will note the lack of an ablation study as a limitation for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on external Kaggle notebooks are independent of self-defined inputs.

full rationale

The paper's derivation consists of formalizing API contracts for random forest pipelines, extracting them as DAGs, and evaluating against those contracts using aggregated data properties. The central claim of 91% precision is obtained by running the resulting tool on external real-world Kaggle notebooks, with no equations, fitted parameters, or self-citations shown to reduce the reported precision or prevalence figures to the authors' own definitions by construction. The evaluation data and labeling procedure are external to the paper's own artifacts, satisfying the criteria for a self-contained, non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that ML pipelines can be faithfully extracted as DAGs and that API contracts can be formalized to match real semantic faults; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption ML pipelines using random forest can be represented as directed acyclic graphs that capture data flow and API calls
    Stated in the approach description as the basis for extraction and contract checking
  • domain assumption Aggregated data properties are sufficient to evaluate data-related faults without raw dataset access
    Explicitly invoked to enable analysis under confidentiality restrictions

pith-pipeline@v0.9.1-grok · 5752 in / 1378 out tokens · 15892 ms · 2026-06-27T21:22:07.567516+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 37 canonical work pages

  1. [1]

    OECD Publishing, May 2025

    OECD/BCG/INSEAD,The Adoption of Artificial Intelligence in Firms: New Evidence for Policymaking. OECD Publishing, May 2025. [Online]. Available: http://dx.doi.org/10.1787/f9ef33c3-en

  2. [2]

    Analyzing AI adoption in European SMEs: A study of digital capabilities, innovation, and external environment,

    M. F. Arroyabe, C. F. Arranz, I. Fernandez De Arroyabe, and J. C. Fernandez de Arroyabe, “Analyzing AI adoption in European SMEs: A study of digital capabilities, innovation, and external environment,” Technology in Society, vol. 79, p. 102733, Dec. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.techsoc. 2024.102733

  3. [3]

    Maintainability challenges in ML: A systematic literature review,

    K. Shivashankar and A. Martini, “Maintainability challenges in ML: A systematic literature review,” in2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, Aug. 2022, p. 60–67. [Online]. Available: http://dx.doi. org/10.1109/SEAA56994.2022.00018

  4. [4]

    Santhanam,Quality Management of Machine Learning Systems

    P. Santhanam,Quality Management of Machine Learning Systems. Springer International Publishing, 2020, p. 1–13. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-030-62144-5 1

  5. [5]

    Characterizing technical debt and antipatterns in AI-based systems: A systematic mapping study,

    J. Bogner, R. Verdecchia, and I. Gerostathopoulos, “Characterizing technical debt and antipatterns in AI-based systems: A systematic mapping study,” in2021 IEEE/ACM International Conference on Technical Debt (TechDebt). IEEE, May 2021, p. 64–73. [Online]. Available: http://dx.doi.org/10.1109/ TechDebt52882.2021.00016

  6. [6]

    Quality issues in machine learning software systems,

    P.-O. C ˆot´e, A. Nikanjam, R. Bouchoucha, I. Basta, M. Abidi, and F. Khomh, “Quality issues in machine learning software systems,”Empirical Software Engineering, vol. 29, no. 6, 2024. [Online]. Available: http://dx.doi.org/10.1007/s10664-024-10536-7

  7. [7]

    Software engineering practices for machine learning — adoption, effects, and team assessment,

    A. Serban, K. van der Blom, H. Hoos, and J. Visser, “Software engineering practices for machine learning — adoption, effects, and team assessment,”Journal of Systems and Software, vol. 209, p. 111907, Mar. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023. 111907

  8. [8]

    Data collection and quality challenges in deep learning: a data- centric AI perspective,

    S. E. Whang, Y . Roh, H. Song, and J.-G. Lee, “Data collection and quality challenges in deep learning: a data- centric AI perspective,”The VLDB Journal, vol. 32, no. 4, p. 791–813, Jan. 2023. [Online]. Available: http://dx.doi.org/10.1007/s00778-022-00775-9

  9. [9]

    Opportunities and challenges in data- centric AI,

    S. Kumar, S. Datta, V . Singh, S. K. Singh, and R. Sharma, “Opportunities and challenges in data- centric AI,”IEEE Access, vol. 12, p. 33173–33189,

  10. [10]

    Available: http://dx.doi.org/10.1109/ ACCESS.2024.3369417

    [Online]. Available: http://dx.doi.org/10.1109/ ACCESS.2024.3369417

  11. [11]

    Testing machine learning and deep learning systems: Achievements and challenges,

    S. Albelali and M. Ahmed, “Testing machine learning and deep learning systems: Achievements and challenges,”Arabian Journal for Science and Engineering, vol. 50, no. 15, p. 11433–11484,

  12. [12]

    Available: http://dx.doi.org/10.1007/ s13369-025-10276-w

    [Online]. Available: http://dx.doi.org/10.1007/ s13369-025-10276-w

  13. [13]

    Architecting ML-enabled systems: Challenges, best practices, and design decisions,

    R. Nazir, A. Bucaioni, and P. Pelliccione, “Architecting ML-enabled systems: Challenges, best practices, and design decisions,”Journal of Systems and Software, vol. 207, p. 111860, Jan. 2024. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2023.111860

  14. [14]

    A checklist of quality concerns for architecting ML- intensive systems,

    A. Bucaioni, R. Kazman, and P. Pelliccione, “A checklist of quality concerns for architecting ML- intensive systems,”Journal of Systems and Software, vol. 231, p. 112612, Jan. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.jss.2025.112612

  15. [15]

    Data-aware static analysis: Improving detection of semantic faults in ma- chine learning code using data characteristics,

    W. Meijer, K. Sandahl, and D. Varr ´o, “Data-aware static analysis: Improving detection of semantic faults in ma- chine learning code using data characteristics,” inIn 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE-NIER ’26). ACM, 2026

  16. [16]

    Taxonomy of real faults in deep learning systems,

    N. Humbatova, G. Jahangirova, G. Bavota, V . Riccio, A. Stocco, and P. Tonella, “Taxonomy of real faults in deep learning systems,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, 2020, p. 1110–1121. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380395

  17. [17]

    Bug analysis in Jupyter notebook projects: An empirical study,

    T. L. De Santana, P. A. D. M. S. Neto, E. S. De Almeida, and I. Ahmed, “Bug analysis in Jupyter notebook projects: An empirical study,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 4, p. 1–34, Apr. 2024. [Online]. Available: http://dx.doi.org/10.1145/3641539

  18. [18]

    What kinds of contracts do ML APIs need?

    S. S. Khairunnesa, S. Ahmed, S. M. Imtiaz, H. Rajan, and G. T. Leavens, “What kinds of contracts do ML APIs need?”Empirical Software Engineering, vol. 28, no. 6, Oct. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10320-z

  19. [19]

    Comparative analysis of real issues in open- source machine learning projects,

    T. D. Lai, A. Simmons, S. Barnett, J.-G. Schneider, and R. Vasa, “Comparative analysis of real issues in open- source machine learning projects,”Empirical Software Engineering, vol. 29, no. 3, May 2024. [Online]. Avail- able: http://dx.doi.org/10.1007/s10664-024-10467-3

  20. [20]

    Bug characterization in machine learning-based systems,

    M. M. Morovati, A. Nikanjam, F. Tambon, F. Khomh, and Z. M. Jiang, “Bug characterization in machine learning-based systems,”Empirical Software Engineer- ing, vol. 29, no. 1, Dec. 2023. [Online]. Available: http://dx.doi.org/10.1007/s10664-023-10400-0

  21. [21]

    Refty: refinement types for valid deep learning models,

    Y . Gao, Z. Li, H. Lin, H. Zhang, M. Wu, and M. Yang, “Refty: refinement types for valid deep learning models,” inProceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. ACM, May 2022, p. 1843–1855. [Online]. Available: http://dx.doi.org/10.1145/3510003.3510077

  22. [22]

    Safe- DS: A domain specific language to make data science safe,

    L. Reimann and G. Kniesel-W ¨unsche, “Safe- DS: A domain specific language to make data science safe,” in2023 IEEE/ACM 45th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, May 2023, p. 72–77. [Online]. Available: http://dx.doi.org/10.1109/ICSE-NIER58687.2023.00019

  23. [23]

    MLScent: A tool for anti-pattern detection in ML projects,

    K. Shivashankar and A. Martini, “MLScent: A tool for anti-pattern detection in ML projects,” in 2025 IEEE/ACM 4th International Conference on AI Engineering – Software Engineering for AI (CAIN). IEEE, Apr. 2025, p. 150–160. [Online]. Available: http://dx.doi.org/10.1109/CAIN66642.2025.00026

  24. [24]

    Design by contract for deep learning APIs,

    S. Ahmed, S. M. Imtiaz, S. S. Khairunnesa, B. D. Cruz, and H. Rajan, “Design by contract for deep learning APIs,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’23. ACM, Nov. 2023, p. 94–106. [Online]. Available: http://dx.doi.org/10.1145/3611643.3616247

  25. [25]

    2025 40th

    A. Turcotte and N. N. Mehta, “The fault in our stats,” in2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Nov. 2025, p. 2491–2503. [Online]. Available: http: //dx.doi.org/10.1109/ASE63991.2025.00205

  26. [26]

    Breiman, Random forests, Mach

    L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, p. 5–32, Oct. 2001. [Online]. Available: http://dx.doi.org/10.1023/A:1010933404324

  27. [27]

    Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology,

    E. W. Fox, R. A. Hill, S. G. Leibowitz, A. R. Olsen, D. J. Thornbrugh, and M. H. Weber, “Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology,”Environmental Monitoring and Assessment, vol. 189, no. 7, 2017. [Online]. Available: http://dx.doi.org/10.1007/s10661-017-6025-0

  28. [28]

    Random forest and feature importance measures for discriminating the most influential environmental factors in predicting cardiovascular and respiratory diseases,

    F. Cappelli, G. Castronuovo, S. Grimaldi, and V . Telesca, “Random forest and feature importance measures for discriminating the most influential environmental factors in predicting cardiovascular and respiratory diseases,” International Journal of Environmental Research and Public Health, vol. 21, no. 7, p. 867, 2024. [Online]. Available: http://dx.doi.o...

  29. [29]

    Why do tree-based models still outperform deep learning on typical tabular data?

    L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 507–520. [Online]. Available: https: //proceedings.neurips....

  30. [30]

    Replication package for

    W. Meijer, K. Sandahl, and D. Varr ´o, “Replication package for ”are we lost in the woods? detecting silent semantic faults for random forest classifiers with data- informed static analysis”,” 2026. [Online]. Available: https://doi.org/10.5281/zenodo.19344519

  31. [31]

    sklearn.ensemble.RandomForestClassifier — scikit-learn 1.8 documentation,

    Scikit-learn, “sklearn.ensemble.RandomForestClassifier — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn. ensemble.RandomForestClassifier.html, 2025

  32. [32]

    Hyperparameters and tuning strategies for random forest,

    P. Probst, M. N. Wright, and A. Boulesteix, “Hyperparameters and tuning strategies for random forest,”WIREs Data Mining and Knowledge Discovery, vol. 9, no. 3, Jan. 2019. [Online]. Available: http://dx.doi.org/10.1002/widm.1301

  33. [33]

    G. M. Weiss,Mining with Rare Cases. Springer US, 2009, p. 747–757. [Online]. Available: http: //dx.doi.org/10.1007/978-0-387-09823-4 38

  34. [34]

    D. A. Cieslak and N. V . Chawla,Learning Decision Trees for Unbalanced Data. Springer Berlin Heidelberg, p. 241–256. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-540-87479-9 34

  35. [35]

    Using random forest to learn imbalanced data,

    C. Chen, A. Liaw, and L. Breiman, “Using random forest to learn imbalanced data,”University of California, Berkeley, vol. 110, no. 1-12, p. 24, 2004

  36. [36]

    The behaviour of random forest permutation- based variable importance measures under predictor correlation,

    K. K. Nicodemus, J. D. Malley, C. Strobl, and A. Ziegler, “The behaviour of random forest permutation- based variable importance measures under predictor correlation,”BMC Bioinformatics, vol. 11, no. 1, Feb. 2010. [Online]. Available: http://dx.doi.org/10. 1186/1471-2105-11-110

  37. [37]

    Prediction, estimation, and attribution,

    B. Efron, “Prediction, estimation, and attribution,” International Statistical Review, vol. 88, no. S1, Dec

  38. [38]

    Available: http://dx.doi.org/10.1111/insr

    [Online]. Available: http://dx.doi.org/10.1111/insr. 12409

  39. [39]

    sklearn.preprocessing.LabelEncoder — scikit-learn 1.8 documentation,

    Scikit-learn, “sklearn.preprocessing.LabelEncoder — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn. preprocessing.LabelEncoder.html, 2025

  40. [40]

    Breiman, J

    L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone,Classification And Regression Trees. Routledge, Oct. 2017. [Online]. Available: http://dx.doi.org/10.1201/ 9781315139470

  41. [41]

    sklearn.tree.DecisionTreeClassifier — scikit-learn 1.8 documentation,

    Scikit-learn, “sklearn.tree.DecisionTreeClassifier — scikit-learn 1.8 documentation,” https: //scikit-learn.org/1.8/modules/generated/sklearn.tree. DecisionTreeClassifier.html, 2025

  42. [42]

    Kaggle notebook — health insurance prediction 94%,

    M. Wiryaseputra, “Kaggle notebook — health insurance prediction 94%,” Kaggle. https://www.kaggle.com/code/ michaelwiryaseputra/health-insurance-prediction-94, 2023

  43. [43]

    24 765:2017, 2017

    ISO/IEC/IEEE,Systems and Software Engineering – Vo- cabulary, Std. 24 765:2017, 2017

  44. [44]

    Scikit-learn: Machine learning in Python,

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in Python,”the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011

  45. [45]

    PYRA: A high-level linter for data science software,

    G. Dolcetti, V . Arceri, A. Mensi, E. Zaffanella, C. Urban, and A. Cortesi, “PYRA: A high-level linter for data science software,”Knowledge-Based Systems, vol. 337, p. 115412, Mar. 2026. [Online]. Available: http://dx.doi.org/10.1016/j.knosys.2026.115412

  46. [46]

    Expressing and checking statistical assumptions,

    A. Turcotte and Z. Wu, “Expressing and checking statistical assumptions,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 2735–2758,

  47. [47]

    Available: http://dx.doi.org/10.1145/ 3729391

    [Online]. Available: http://dx.doi.org/10.1145/ 3729391

  48. [48]

    Investigating and detecting silent bugs in PyTorch programs,

    S. Hong, H. Sun, X. Gao, and S. H. Tan, “Investigating and detecting silent bugs in PyTorch programs,” in2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Mar. 2024, p. 272–283. [Online]. Available: http: //dx.doi.org/10.1109/SANER60148.2024.00035

  49. [49]

    Towards understanding fine-grained programming mistakes and fixing patterns in data science,

    W.-H. Chen, J. L. Cheoh, M. Keim, S. Brunswicker, and T. Zhang, “Towards understanding fine-grained programming mistakes and fixing patterns in data science,”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 1824–1846, 2025. [Online]. Available: http://dx.doi.org/10.1145/3729352

  50. [50]

    Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,

    Y . Wang, W. Meijer, J. A. H. L ´opez, U. Nilsson, and D. Varr ´o, “Why do machine learning notebooks crash? an empirical study on public Python Jupyter notebooks,”IEEE Transactions on Software Engineering, vol. 51, no. 7, p. 2181–2196, 2025. [Online]. Available: http://dx.doi.org/10.1109/TSE.2025.3574500

  51. [51]

    An empirical study on TensorFlow program bugs,

    Y . Zhang, Y . Chen, S.-C. Cheung, Y . Xiong, and L. Zhang, “An empirical study on TensorFlow program bugs,” inProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’18. ACM, 2018, p. 129–140. [Online]. Available: http://dx.doi.org/10.1145/3213846. 3213866

  52. [52]

    A comprehensive study on deep learning bug characteristics,

    M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study on deep learning bug characteristics,” inProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE ’19. ACM, Aug. 2019, p. 510–520. [Online]. Available: http://dx.doi.org/10.1145/3338...

  53. [53]

    Empirical review of automated analysis tools on 47,587 ethereum smart contracts,

    R. Zhang, W. Xiao, H. Zhang, Y . Liu, H. Lin, and M. Yang, “An empirical study on program failures of deep learning jobs,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. ACM, 2020, p. 1159–1170. [Online]. Available: http://dx.doi.org/10.1145/3377811.3380362

  54. [54]

    Why don’t software developers use static analysis tools to find bugs?

    B. Johnson, Y . Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in2013 35th International Conference on Software Engineering (ICSE). IEEE, May 2013, p. 672–681. [Online]. Available: http: //dx.doi.org/10.1109/ICSE.2013.6606613

  55. [55]

    Better code, better sharing: on the need of analyzing Jupyter notebooks,

    J. Wang, L. Li, and A. Zeller, “Better code, better sharing: on the need of analyzing Jupyter notebooks,” inProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE ’20. ACM, 2020, p. 53–56. [Online]. Available: http://dx.doi.org/10.1145/3377816. 3381724

  56. [56]

    A large-scale study about quality and reproducibility of Jupyter notebooks,

    J. F. Pimentel, L. Murta, V . Braganholo, and J. Freire, “A large-scale study about quality and reproducibility of Jupyter notebooks,” in2019 IEEE/ACM 16th Inter- national Conference on Mining Software Repositories (MSR). IEEE, May 2019, p. 507–517. [Online]. Available: http://dx.doi.org/10.1109/MSR.2019.00077

  57. [57]

    Code duplication and reuse in Jupyter notebooks,

    A. P. Koenzen, N. A. Ernst, and M.-A. D. Storey, “Code duplication and reuse in Jupyter notebooks,” in2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, Aug. 2020, p. 1–9. [Online]. Available: http://dx.doi.org/10. 1109/VL/HCC50065.2020.9127202

  58. [58]

    Linear regression models with logarithmic transformations,

    K. Benoit, “Linear regression models with logarithmic transformations,”London School of Economics, London, vol. 22, no. 1, pp. 23–36, 2011

  59. [59]

    A coefficient of agreement for nominal scales,

    J. Cohen, “A coefficient of agreement for nominal scales,”Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960

  60. [60]

    Decoding the mystery: How can LLMs turn text into Cypher in complex knowledge graphs?IEEE Access, 13:80981–81001, 2025

    M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced data problem in machine learning: A review,”IEEE Access, vol. 13, p. 13686–13699, 2025. [Online]. Available: http://dx.doi.org/10.1109/ACCESS. 2025.3531662

  61. [61]

    Leakage and the reproducibility crisis in machine-learning-based science,

    S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in machine-learning-based science,” Patterns, vol. 4, no. 9, p. 100804, 2023. [Online]. Available: http://dx.doi.org/10.1016/j.patter.2023.100804