pith. machine review for the scientific record. sign in

arxiv: 2604.23342 · v1 · submitted 2026-04-25 · 💻 cs.SE

Recognition: unknown

Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:00 UTC · model grok-4.3

classification 💻 cs.SE
keywords test selection metricsdeep learning testingout-of-distribution scenariosempirical studyfault detectionperformance estimationsoftware quality
0
0 comments X

The pith

Test selection metrics for deep learning show inconsistent performance depending on objectives, OOD shifts, and data modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior evaluations of test selection metrics for deep learning systems have been limited to narrow objectives such as fault detection, mostly image data, and restricted out-of-distribution scenarios. This paper addresses the gap by performing a broad empirical comparison of 15 metrics across three testing objectives, five OOD scenario types, three data modalities, and 13 models in a total of 1,640 scenarios, accompanied by statistical analysis. A sympathetic reader cares because poor metric choice in safety-critical applications like autonomous driving or malware detection can leave unexpected system behaviors untested. The resulting insights help practitioners match metrics to their specific contexts rather than relying on incomplete prior guidance.

Core claim

We conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.

What carries the argument

The multi-objective multi-scenario empirical benchmark that varies testing goals, distribution shifts, modalities, and models to compare the 15 metrics.

If this is right

  • Metric effectiveness for fault detection does not necessarily carry over to performance estimation or retraining guidance.
  • Results obtained on image data may not transfer to text or Android package modalities.
  • Metrics suited to adversarial or corrupted shifts may behave differently under natural or label shifts.
  • Practitioners gain evidence to select metrics matched to their particular objective and expected shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended with newer metrics or larger-scale models to test whether current patterns persist.
  • Safety-critical testing pipelines might benefit from objective-specific selection logic rather than a single default metric.
  • The study suggests opportunities to design hybrid metrics that adapt when multiple objectives or shift types are present simultaneously.

Load-bearing premise

The chosen 15 metrics, three objectives, five OOD types, three modalities, and 13 models are sufficiently representative that the resulting rankings and statistical findings generalize to other DL systems and real-world deployment contexts.

What would settle it

A replication study that uses a fresh set of models or additional real-world OOD cases and finds substantially different performance orderings among the metrics for the same objectives.

Figures

Figures reproduced from arXiv: 2604.23342 by Fan Wang, Jacky Keung, Jingyu Zhang, Lei Ma, Yan Xiao, Yihan Liao.

Figure 1
Figure 1. Figure 1: Overview of our study: 1. Test Input Preparation, which prepares a testing set that contains either view at source ↗
Figure 2
Figure 2. Figure 2: Feature visualization of samples in a high-quality cluster (‘best’, Udacity: ResNet-50, UMAP, DBSCAN; view at source ↗
Figure 3
Figure 3. Figure 3: The rankings of each test selection metric across different budgets for each studied criterion (#Mis., view at source ↗
read the original abstract

Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that prior evaluations of test selection metrics for deep learning systems suffer from narrow testing objectives, limited OOD scenario coverage, and biased dataset selection. To address this gap, the authors conduct a large-scale empirical study evaluating 15 existing metrics under three testing objectives (fault detection, performance estimation, retraining guidance), five OOD scenarios (corrupted, adversarial, temporal, natural, label shifts), three data modalities (image, text, Android packages), and 13 DL models, for a total of 1,640 experimental scenarios, accompanied by statistical analysis to guide metric selection.

Significance. If the empirical findings hold and generalize, this work would provide substantial practical value by delivering a unified benchmark that clarifies which test selection metrics are effective under varying objectives and distribution shifts in safety-critical DL applications. The scale of 1,640 scenarios across multiple modalities is a notable strength that could support more informed practitioner decisions than narrower prior studies.

major comments (2)
  1. [Abstract] Abstract: The description of the experimental design provides no details on the statistical methods used for analysis, multiple-testing corrections applied, or the precise implementation of the 15 metrics. Without this information, the reliability of the claimed 'comprehensive evaluation and statistical analysis' across 1,640 scenarios cannot be assessed.
  2. [Study design] Study design (as described in the abstract and implied experimental setup): The central claim of offering generalizable insights from a unified benchmark rests on the assumption that the chosen 15 metrics, 13 models, five OOD types, and three modalities are representative. The manuscript does not appear to include sensitivity analyses (e.g., swapping in additional architectures such as Vision Transformers or varying dataset sizes) to verify stability of the metric rankings, which is load-bearing for the generalization of the findings.
minor comments (1)
  1. [Abstract] Abstract: The sentence fragment 'with natural and label shifts are rarely considered' contains a grammatical error and should be rephrased for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the scale and potential practical value of our 1,640-scenario benchmark. We address each major comment below with honest revisions where the manuscript can be strengthened without misrepresenting our work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The description of the experimental design provides no details on the statistical methods used for analysis, multiple-testing corrections applied, or the precise implementation of the 15 metrics. Without this information, the reliability of the claimed 'comprehensive evaluation and statistical analysis' across 1,640 scenarios cannot be assessed.

    Authors: We agree the abstract's brevity omits these specifics. The full manuscript details the statistical approach in Section 3 (non-parametric tests including Wilcoxon signed-rank with Bonferroni correction for multiple comparisons across objectives and shifts) and metric implementations in Section 4 plus Appendix A (following original papers with our noted adaptations for each modality). In revision we will expand the abstract with a concise clause referencing the statistical framework and directing readers to those sections. revision: yes

  2. Referee: [Study design] Study design (as described in the abstract and implied experimental setup): The central claim of offering generalizable insights from a unified benchmark rests on the assumption that the chosen 15 metrics, 13 models, five OOD types, and three modalities are representative. The manuscript does not appear to include sensitivity analyses (e.g., swapping in additional architectures such as Vision Transformers or varying dataset sizes) to verify stability of the metric rankings, which is load-bearing for the generalization of the findings.

    Authors: Our 13 models were selected for diversity across modalities and architectures (CNNs, RNNs, and transformers where available in each domain) to reflect common practice; the five OOD types and 15 metrics likewise follow prevalence in prior work. The manuscript does not contain explicit sensitivity analyses such as adding Vision Transformers or dataset-size variations. We will add a dedicated paragraph in the Discussion section acknowledging this limitation, discussing potential impacts on ranking stability based on the observed consistency across the existing 1,640 scenarios, and outlining why the current selection supports the reported insights. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct experimental outcomes

full rationale

The paper performs an empirical study by selecting 15 metrics, running them on 13 models across 3 objectives, 5 OOD types, and 3 modalities (1640 scenarios total), then reporting statistical rankings. No equations, derivations, fitted parameters, or predictions appear; results are measured outcomes rather than re-statements of inputs. Prior-work citations only motivate the gap and do not load-bear any uniqueness theorem or ansatz. The representativeness concern is a validity issue, not a circular reduction of the claimed findings to the chosen setups by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the representativeness of the selected metrics and scenarios rather than on any fitted parameters or new invented entities.

axioms (2)
  • domain assumption The 15 chosen metrics adequately represent the space of existing test selection approaches in the literature.
    Abstract states that 15 existing metrics were evaluated but does not justify the selection criteria.
  • domain assumption The five listed OOD scenarios and three modalities capture the distribution shifts and data types relevant to safety-critical DL deployment.
    Abstract enumerates the scenarios without providing evidence that they are exhaustive or representative of real-world shifts.

pith-pipeline@v0.9.0 · 5575 in / 1352 out tokens · 48679 ms · 2026-05-08T08:00:17.471297+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 67 canonical work pages · 5 internal anchors

  1. [1]

    Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, S Ramesh, and Mojtaba Bagherzadeh. 2023. Black-box testing of deep neural networks through test case diversity.IEEE Transactions on Software Engineering49, 5 (2023), 3182–3204. doi:10.1109/TSE.2023.3243522

  2. [2]

    Zohreh Aghababaeyan, Manel Abdellatif, Mahboubeh Dadkhah, and Lionel Briand. 2024. Deepgd: A multi-objective black-box test selection approach for deep neural networks.ACM Transactions on Software Engineering and Methodology 33, 6 (2024), 1–29. doi:10.1145/3644388

  3. [3]

    Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of Android apps for the research community. InProceedings of the 13th international conference on mining software repositories. 468–471. doi:10.1145/2901739.2903508

  4. [4]

    Kylie Anglin, Qing Liu, and Vivian C Wong. 2024. A primer on the validity typology and threats to validity in education research.Asia Pacific Education Review25, 3 (2024), 557–574. doi:10.1007/s12564-024-09955-4

  5. [5]

    Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket.. InNdss, Vol. 14. 23–26. doi:10.14722/ndss.2014. 23247

  6. [6]

    Mohammed Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2023. Black-box safety analysis and retraining of DNNs based on feature extraction and clustering.ACM Transactions on Software Engineering and Methodology32, 3 (2023), 1–40. doi:10.1145/3550271

  7. [7]

    Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2024. Supporting safety analysis of image-processing dnns through clustering-based approaches.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–48. doi:10.1145/3643671

  8. [8]

    David Berend, Xiaofei Xie, Lei Ma, Lingjun Zhou, Yang Liu, Chi Xu, and Jianjun Zhao. 2020. Cats are not fish: Deep learning testing calls for out-of-distribution awareness. InProceedings of the 35th IEEE/ACM international conference on automated software engineering. 1041–1052. doi:10.1145/3324884.3416609

  9. [9]

    Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. InProceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 245–250. doi:10.1145/502512.502546

  10. [10]

    Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316(2016). https://arxiv.org/abs/1604.07316

  11. [11]

    Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 63–70. doi:10.1109/AITest.2019.000-6

  12. [12]

    Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing.ACM Transactions on Software Engineering and Methodology (TOSEM)29, 4 (2020), 1–35. doi:10.1145/3394112

  13. [13]

    Lingjiao Chen, Matei Zaharia, and James Y Zou. 2022. Estimating and explaining model performance when both covariates and labels shift.Advances in Neural Information Processing Systems35 (2022), 11467–11479. https: //dl.acm.org/doi/10.5555/3600270.3601103

  14. [14]

    Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555(2020). https://arxiv.org/abs/2003.10555

  15. [15]

    Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. 2017. EMNIST: Extending MNIST to hand- written letters.2017 International Joint Conference on Neural Networks (IJCNN)(2017). doi:10.1109/ijcnn.2017.7966217

  16. [16]

    Demet Demir, Aysu Betin Can, and Elif Surer. 2024. Test selection for deep neural networks using meta-models with uncertainty metrics. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 678–690. doi:10.1145/3650212.3680312

  17. [17]

    K. P. Devakumar. 2020. IMDB Review Classification LSTM, GRU, CNN, GloVe. https://www.kaggle.com/code/imdevskp/imdb-review-classification-lstm-gru-cnn-glove/

  18. [18]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. doi:10.18653...

  19. [19]

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, Vol. 96. 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507

  20. [20]

    Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 177–188. doi:10.1145/3395363.3397357 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058...

  21. [21]

    Xinyu Gao, Yang Feng, Yining Yin, Zixi Liu, Zhenyu Chen, and Baowen Xu. 2022. Adaptive test selection for deep neural networks. InProceedings of the 44th international conference on software engineering. 73–85. doi:10.1145/3510003.3510232

  22. [22]

    Jacob Gildenblat. 2016. Visualizations for understanding the regressed wheel steering angle for self-driving cars. https://github.com/jacobgil/keras-steering-angle-visualizations

  23. [23]

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572(2014). https://doi.org/10.48550/arXiv.1412.6572

  24. [24]

    Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. 2017. Adversarial examples for malware detection. InComputer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part II 22. Springer, 62–79. doi:10.1007/978-3-319- 66399-9_4

  25. [26]

    Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2024. DeepSample: DNN sampling-based testing for operational accuracy assessment. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12. doi:10.1109/ICSE43902.2021.00042

  26. [27]

    Chris Gundling. 2016. Steering angle model: Cg32.https://github.com/udacity/self-driving-car/tree/master/steering- models/community-models/cg23

  27. [28]

    Yao Hao, Zhiqiu Huang, Hongjing Guo, and Guohua Shen. 2023. Test input selection for deep neural network enhancement based on multiple-objective optimization. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 534–545. doi:10.1109/SANER56733.2023.00056

  28. [29]

    Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is neuron coverage a meaningful measure for testing deep neural networks?. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 851–862. doi:10.1145/3368089.3409754

  29. [30]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. doi:10.1109/CVPR.2016.90

  30. [31]

    Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136(2016). doi:10.48550/arXiv.1610.02136

  31. [32]

    Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. InProceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 168–177. doi:10.1145/1014052.1014073

  32. [33]

    Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Lei Ma, Mike Papadakis, and Yves Le Traon. 2022. An empirical study on data distribution-aware test selection for deep learning enhancement.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 4 (2022), 1–30. doi:10.1145/3511598

  33. [34]

    Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, and Yves Le Traon. 2024. Test optimization in dnn testing: A survey.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–42. doi:10.1145/ 3643678

  34. [35]

    Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Wei Ma, Mike Papadakis, Lei Ma, and Yves Le Traon. 2025. Assessing the Robustness of Test Selection Methods for Deep Neural Networks.ACM Transactions on Software Engineering and Methodology(2025). doi:10.1145/3715693

  35. [36]

    Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. Leakage in data mining: Formulation, detection, and avoidance.ACM Transactions on Knowledge Discovery from Data (TKDD)6, 4 (2012), 1–21. doi:10.1145/ 2382577.2382579

  36. [37]

    Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1039–1049. doi:10.1109/ICSE.2019.00108

  37. [38]

    Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. 2020. Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1466–1476. doi:10.1145/3368089. 3417065

  38. [39]

    Ronald S King. 2013. Cluster analysis and data mining: An introduction. (2013). https://dl.acm.org/doi/abs/10.5555/ 2823861

  39. [40]

    Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. InArtificial intelligence safety and security. Chapman and Hall/CRC, 99–112. doi:10.1201/9781351251389-8

  40. [41]

    Yann LeCun. 1998. The MNIST database of handwritten digits.http://yann. lecun. com/exdb/mnist/(1998). doi:10. 24432/C53K8Q

  41. [42]

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition.Proc. IEEE86, 11 (1998), 2278–2324. doi:10.1109/5.726791 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058. Publication date: July 2026. Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribut...

  42. [43]

    Deqiang Li and Qianmu Li. 2020. Adversarial deep ensemble: Evasion attacks and defenses for malware detection. IEEE Transactions on Information Forensics and Security15 (2020), 3886–3900. doi:10.1109/TIFS.2020.3003571

  43. [44]

    Deqiang Li, Tian Qiu, Shuo Chen, Qianmu Li, and Shouhuai Xu. 2021. Can we leverage predictive uncertainty to detect dataset shift and adversarial examples in android malware detection?. InProceedings of the 37th Annual Computer Security Applications Conference. 596–608. doi:10.1145/3485832.3485916

  44. [45]

    Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, and Tegawendé F Bissyandé. 2024. Prioritizing test cases for deep learning-based video classifiers.Empirical Software Engineering29, 5 (2024), 111. doi:10.1007/s10664-024-10520-1

  45. [46]

    Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Yves Le Traon, and Tegawende F Bissyande. 2024. Test input prioritization for 3d point clouds.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–44. doi:10.1145/3643676

  46. [47]

    Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational dnn testing efficiency through conditioning. InProceedings of the 2019 27th ACM Joint meeting on European software engineering conference and symposium on the foundations of software engineering. 499–509. doi:10.1145/3338906.3338930

  47. [48]

    Jingjing Liang, Sebastian Elbaum, and Gregg Rothermel. 2018. Redefining prioritization: continuous prioritization for continuous integration. InProceedings of the 40th International Conference on Software Engineering. 688–698. doi:10.1145/3180155.3180213

  48. [49]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019). https://doi.org/10.48550/arXiv.1907.11692

  49. [50]

    Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al

  50. [51]

    InProceedings of the 33rd ACM/IEEE international conference on automated software engineering

    Deepgauge: Multi-granularity testing criteria for deep learning systems. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 120–131. doi:10.1145/3238147.3238202

  51. [52]

    Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–22. doi:10.1145/3417330

  52. [53]

    Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 142–150. https://dl.acm.org/doi/10.5555/2002472.2002491

  53. [54]

    James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Vol. 5. University of California press, 281–298. https://api.semanticscholar.org/CorpusID:6278891

  54. [55]

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083(2017). https://doi.org/10.48550/arXiv. 1706.06083

  55. [56]

    Leland McInnes, John Healy, Steve Astels, et al. 2017. hdbscan: Hierarchical density based clustering.J. Open Source Softw.2, 11 (2017), 205. doi:10.21105/joss.00205

  56. [57]

    Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426(2018). doi:10.21105/joss.00861

  57. [58]

    Niall McLaughlin, Jesus Martinez del Rincon, BooJoong Kang, Suleiman Yerima, Paul Miller, Sakir Sezer, Yeganeh Safaei, Erik Trickel, Ziming Zhao, Adam Doupé, et al. 2017. Deep android malware detection. InProceedings of the seventh ACM on conference on data and application security and privacy. 301–308. doi:10.1145/3029806.3029823

  58. [59]

    Ecker, Matthias Bethge, and Wieland Brendel

    Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S. Ecker, Matthias Bethge, and Wieland Brendel. 2019. Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming.arXiv preprint arXiv:1907.07484(2019). https://doi.org/10.48550/arXiv.1907.07484

  59. [60]

    Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift in classification.Pattern recognition45, 1 (2012), 521–530. doi:10.1016/j.patcog.2011.06.019

  60. [61]

    Vasilii Mosin, Miroslaw Staron, Darko Durisic, Francisco Gomes de Oliveira Neto, Sushant Kumar Pandey, and Ashok Chaitanya Koppisetty. 2022. Comparing input prioritization techniques for testing deep learning algorithms. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 76–83. doi:10.1109/ SEAA56994.2022.00020

  61. [62]

    Davoud Moulavi, Pablo A Jaskowiak, Ricardo JGB Campello, Arthur Zimek, and Jörg Sander. 2014. Density-based clustering validation. InProceedings of the 2014 SIAM international conference on data mining. SIAM, 839–847. doi:10. 1137/1.9781611973440.96

  62. [63]

    Norman Mu and Justin Gilmer. 2019. Mnist-c: A robustness benchmark for computer vision.arXiv preprint arXiv:1906.02337(2019). https://doi.org/10.48550/arXiv.1906.02337

  63. [64]

    Apoorv Nandan. 2020. Text classification with Transformer.https://keras.io/examples/nlp/text_classification_with_ transformer/. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058. Publication date: July 2026. FSE058:24 Jingyu Zhang, Fan Wang, Jacky Keung, Yihan Liao, Yan Xiao, and Lei Ma

  64. [65]

    Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin philosophical magazine and journal of science2, 11 (1901), 559–572. doi:10.1080/14786440109462720

  65. [66]

    Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles. 1–18. doi:10.1145/3361566

  66. [67]

    Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. InProceedings of the 57th annual meeting of the association for computational linguistics. 1085–1097. https://api.semanticscholar.org/CorpusID:196202909

  67. [68]

    Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics20 (1987), 53–65. doi:10.1016/0377-0427(87)90125-7

  68. [69]

    Alexandru Constantin Serban, Erik Poll, and Joost Visser. 2018. Adversarial examples-a complete characterisation of the phenomenon.arXiv preprint arXiv:1810.01185(2018). doi:10.48550/arXiv.1810.01185

  69. [70]

    Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 410–422. doi:10.1145/3324884.3416621

  70. [71]

    Ying Shi, Beibei Yin, Zheng Zheng, and Tiancheng Li. 2021. An empirical study on test case prioritization metrics for deep neural networks. In2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 157–166. doi:10.1109/QRS54544.2021.00027

  71. [72]

    Alex Staravoitau. 2016. Behavioral cloning: end-to-end learning for self-driving cars. https://github.com/navoshta/behavioral-cloning

  72. [73]

    Sully-Chen. 2016. Autopilot-Tensorflow.https://github.com/SullyChen/Autopilot-TensorFlow

  73. [74]

    Weifeng Sun, Meng Yan, Zhongxin Liu, and David Lo. 2023. Robust test selection for deep neural networks.IEEE Transactions on Software Engineering49, 12 (2023), 5250–5278. doi:10.1109/TSE.2023.3330982

  74. [75]

    Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. 2016. An empirical comparison of model validation techniques for defect prediction models.IEEE Transactions on Software Engineering43, 1 (2016), 1–18. doi:10.1109/TSE.2016.2584050

  75. [76]

    TextAttack. [n. d.]. Transforms an input by replacing any word with ’banana’.https://textattack.readthedocs.io/en/latest/

  76. [77]

    Udacity. 2016. Using Deep Learning to Predict Steering Angles. https://medium.com/udacity/challenge-2-using-deep- learning-to-predict-steering-angles-f42004a36ff3

  77. [78]

    Dan Wang and Yi Shang. 2014. A new active labeling method for deep learning. In2014 International joint conference on neural networks (IJCNN). IEEE, 112–119. doi:10.1109/IJCNN.2014.6889457

  78. [79]

    Zhiyu Wang, Sihan Xu, Lingling Fan, Xiangrui Cai, Linyu Li, and Zheli Liu. 2024. Can coverage criteria guide failure discovery for image classifiers? an empirical study.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–28. doi:10.1145/3672446

  79. [80]

    Michael Weiss and Paolo Tonella. 2022. Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study). InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 139–150. doi:10.1145/3533767.3534375

  80. [81]

    Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems27 (2014). https://api.semanticscholar.org/CorpusID:362467

Showing first 80 references.