arxiv: 2604.23342 · v1 · submitted 2026-04-25 · 💻 cs.SE

Recognition: unknown

Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts

Jingyu Zhang , Fan Wang , Jacky Keung , Yihan Liao , Yan Xiao , Lei Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:00 UTC · model grok-4.3

classification 💻 cs.SE

keywords test selection metricsdeep learning testingout-of-distribution scenariosempirical studyfault detectionperformance estimationsoftware quality

0 comments

The pith

Test selection metrics for deep learning show inconsistent performance depending on objectives, OOD shifts, and data modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior evaluations of test selection metrics for deep learning systems have been limited to narrow objectives such as fault detection, mostly image data, and restricted out-of-distribution scenarios. This paper addresses the gap by performing a broad empirical comparison of 15 metrics across three testing objectives, five OOD scenario types, three data modalities, and 13 models in a total of 1,640 scenarios, accompanied by statistical analysis. A sympathetic reader cares because poor metric choice in safety-critical applications like autonomous driving or malware detection can leave unexpected system behaviors untested. The resulting insights help practitioners match metrics to their specific contexts rather than relying on incomplete prior guidance.

Core claim

We conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.

What carries the argument

The multi-objective multi-scenario empirical benchmark that varies testing goals, distribution shifts, modalities, and models to compare the 15 metrics.

If this is right

Metric effectiveness for fault detection does not necessarily carry over to performance estimation or retraining guidance.
Results obtained on image data may not transfer to text or Android package modalities.
Metrics suited to adversarial or corrupted shifts may behave differently under natural or label shifts.
Practitioners gain evidence to select metrics matched to their particular objective and expected shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended with newer metrics or larger-scale models to test whether current patterns persist.
Safety-critical testing pipelines might benefit from objective-specific selection logic rather than a single default metric.
The study suggests opportunities to design hybrid metrics that adapt when multiple objectives or shift types are present simultaneously.

Load-bearing premise

The chosen 15 metrics, three objectives, five OOD types, three modalities, and 13 models are sufficiently representative that the resulting rankings and statistical findings generalize to other DL systems and real-world deployment contexts.

What would settle it

A replication study that uses a fresh set of models or additional real-world OOD cases and finds substantially different performance orderings among the metrics for the same objectives.

Figures

Figures reproduced from arXiv: 2604.23342 by Fan Wang, Jacky Keung, Jingyu Zhang, Lei Ma, Yan Xiao, Yihan Liao.

**Figure 1.** Figure 1: Overview of our study: 1. Test Input Preparation, which prepares a testing set that contains either view at source ↗

**Figure 2.** Figure 2: Feature visualization of samples in a high-quality cluster (‘best’, Udacity: ResNet-50, UMAP, DBSCAN; view at source ↗

**Figure 3.** Figure 3: The rankings of each test selection metric across different budgets for each studied criterion (#Mis., view at source ↗

read the original abstract

Deep learning (DL)-based systems can exhibit unexpected behavior when exposed to out-of-distribution (OOD) scenarios, posing serious risks in safety-critical domains such as malware detection and autonomous driving. This underscores the importance of thoroughly testing such systems before deployment. To this end, researchers have proposed a wide range of test selection metrics designed to effectively select inputs. However, prior evaluations of metrics reveal three key limitations: (1) narrow testing objectives, for example, many studies assess metrics only for fault detection, leaving their effectiveness for performance estimation unclear; (2) limited coverage of OOD scenarios, with natural and label shifts are rarely considered; (3) Biased dataset selection, where most work focuses on image data while other modalities remain underexplored. Consequently, a unified benchmark that examines how these metrics perform under multiple testing objectives, diverse OOD scenarios, and different data modalities is still lacking. This leaves practitioners uncertain about which test selection metrics are most suitable for their specific objectives and contexts. To address this gap, we conduct an extensive empirical study of 15 existing metrics, evaluating them under three testing objectives (fault detection, performance estimation, and retraining guidance), five types of OOD scenarios (corrupted, adversarial, temporal, natural, and label shifts), three data modalities (image, text, and Android packages), and 13 DL models. In total, our study encompasses 1,640 experimental scenarios, offering a comprehensive evaluation and statistical analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a larger-than-usual empirical comparison of test selection metrics that adds coverage across objectives and modalities, but its rankings rest on a fixed set of models and shifts without clear checks for stability.

read the letter

The paper runs an empirical study of 15 test selection metrics under three objectives (fault detection, performance estimation, retraining guidance), five OOD shift types (including natural and label), three modalities (image, text, Android), and 13 models for a total of 1640 scenarios. That scale and the joint treatment of objectives plus non-image data are the concrete steps beyond the narrower prior work they cite. The statistical analysis they mention is also a step up from many smaller studies in this corner of software engineering for deep learning.

Referee Report

2 major / 1 minor

Summary. The paper claims that prior evaluations of test selection metrics for deep learning systems suffer from narrow testing objectives, limited OOD scenario coverage, and biased dataset selection. To address this gap, the authors conduct a large-scale empirical study evaluating 15 existing metrics under three testing objectives (fault detection, performance estimation, retraining guidance), five OOD scenarios (corrupted, adversarial, temporal, natural, label shifts), three data modalities (image, text, Android packages), and 13 DL models, for a total of 1,640 experimental scenarios, accompanied by statistical analysis to guide metric selection.

Significance. If the empirical findings hold and generalize, this work would provide substantial practical value by delivering a unified benchmark that clarifies which test selection metrics are effective under varying objectives and distribution shifts in safety-critical DL applications. The scale of 1,640 scenarios across multiple modalities is a notable strength that could support more informed practitioner decisions than narrower prior studies.

major comments (2)

[Abstract] Abstract: The description of the experimental design provides no details on the statistical methods used for analysis, multiple-testing corrections applied, or the precise implementation of the 15 metrics. Without this information, the reliability of the claimed 'comprehensive evaluation and statistical analysis' across 1,640 scenarios cannot be assessed.
[Study design] Study design (as described in the abstract and implied experimental setup): The central claim of offering generalizable insights from a unified benchmark rests on the assumption that the chosen 15 metrics, 13 models, five OOD types, and three modalities are representative. The manuscript does not appear to include sensitivity analyses (e.g., swapping in additional architectures such as Vision Transformers or varying dataset sizes) to verify stability of the metric rankings, which is load-bearing for the generalization of the findings.

minor comments (1)

[Abstract] Abstract: The sentence fragment 'with natural and label shifts are rarely considered' contains a grammatical error and should be rephrased for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the scale and potential practical value of our 1,640-scenario benchmark. We address each major comment below with honest revisions where the manuscript can be strengthened without misrepresenting our work.

read point-by-point responses

Referee: [Abstract] Abstract: The description of the experimental design provides no details on the statistical methods used for analysis, multiple-testing corrections applied, or the precise implementation of the 15 metrics. Without this information, the reliability of the claimed 'comprehensive evaluation and statistical analysis' across 1,640 scenarios cannot be assessed.

Authors: We agree the abstract's brevity omits these specifics. The full manuscript details the statistical approach in Section 3 (non-parametric tests including Wilcoxon signed-rank with Bonferroni correction for multiple comparisons across objectives and shifts) and metric implementations in Section 4 plus Appendix A (following original papers with our noted adaptations for each modality). In revision we will expand the abstract with a concise clause referencing the statistical framework and directing readers to those sections. revision: yes
Referee: [Study design] Study design (as described in the abstract and implied experimental setup): The central claim of offering generalizable insights from a unified benchmark rests on the assumption that the chosen 15 metrics, 13 models, five OOD types, and three modalities are representative. The manuscript does not appear to include sensitivity analyses (e.g., swapping in additional architectures such as Vision Transformers or varying dataset sizes) to verify stability of the metric rankings, which is load-bearing for the generalization of the findings.

Authors: Our 13 models were selected for diversity across modalities and architectures (CNNs, RNNs, and transformers where available in each domain) to reflect common practice; the five OOD types and 15 metrics likewise follow prevalence in prior work. The manuscript does not contain explicit sensitivity analyses such as adding Vision Transformers or dataset-size variations. We will add a dedicated paragraph in the Discussion section acknowledging this limitation, discussing potential impacts on ranking stability based on the observed consistency across the existing 1,640 scenarios, and outlining why the current selection supports the reported insights. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct experimental outcomes

full rationale

The paper performs an empirical study by selecting 15 metrics, running them on 13 models across 3 objectives, 5 OOD types, and 3 modalities (1640 scenarios total), then reporting statistical rankings. No equations, derivations, fitted parameters, or predictions appear; results are measured outcomes rather than re-statements of inputs. Prior-work citations only motivate the gap and do not load-bear any uniqueness theorem or ansatz. The representativeness concern is a validity issue, not a circular reduction of the claimed findings to the chosen setups by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the representativeness of the selected metrics and scenarios rather than on any fitted parameters or new invented entities.

axioms (2)

domain assumption The 15 chosen metrics adequately represent the space of existing test selection approaches in the literature.
Abstract states that 15 existing metrics were evaluated but does not justify the selection criteria.
domain assumption The five listed OOD scenarios and three modalities capture the distribution shifts and data types relevant to safety-critical DL deployment.
Abstract enumerates the scenarios without providing evidence that they are exhaustive or representative of real-world shifts.

pith-pipeline@v0.9.0 · 5575 in / 1352 out tokens · 48679 ms · 2026-05-08T08:00:17.471297+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 67 canonical work pages · 5 internal anchors

[1]

Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, S Ramesh, and Mojtaba Bagherzadeh. 2023. Black-box testing of deep neural networks through test case diversity.IEEE Transactions on Software Engineering49, 5 (2023), 3182–3204. doi:10.1109/TSE.2023.3243522

work page doi:10.1109/tse.2023.3243522 2023
[2]

Zohreh Aghababaeyan, Manel Abdellatif, Mahboubeh Dadkhah, and Lionel Briand. 2024. Deepgd: A multi-objective black-box test selection approach for deep neural networks.ACM Transactions on Software Engineering and Methodology 33, 6 (2024), 1–29. doi:10.1145/3644388

work page doi:10.1145/3644388 2024
[3]

Kevin Allix, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2016. Androzoo: Collecting millions of Android apps for the research community. InProceedings of the 13th international conference on mining software repositories. 468–471. doi:10.1145/2901739.2903508

work page doi:10.1145/2901739.2903508 2016
[4]

Kylie Anglin, Qing Liu, and Vivian C Wong. 2024. A primer on the validity typology and threats to validity in education research.Asia Pacific Education Review25, 3 (2024), 557–574. doi:10.1007/s12564-024-09955-4

work page doi:10.1007/s12564-024-09955-4 2024
[5]

Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket.. InNdss, Vol. 14. 23–26. doi:10.14722/ndss.2014. 23247

work page doi:10.14722/ndss.2014 2014
[6]

Mohammed Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2023. Black-box safety analysis and retraining of DNNs based on feature extraction and clustering.ACM Transactions on Software Engineering and Methodology32, 3 (2023), 1–40. doi:10.1145/3550271

work page doi:10.1145/3550271 2023
[7]

Mohammed Oualid Attaoui, Hazem Fahmy, Fabrizio Pastore, and Lionel Briand. 2024. Supporting safety analysis of image-processing dnns through clustering-based approaches.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–48. doi:10.1145/3643671

work page doi:10.1145/3643671 2024
[8]

David Berend, Xiaofei Xie, Lei Ma, Lingjun Zhou, Yang Liu, Chi Xu, and Jianjun Zhao. 2020. Cats are not fish: Deep learning testing calls for out-of-distribution awareness. InProceedings of the 35th IEEE/ACM international conference on automated software engineering. 1041–1052. doi:10.1145/3324884.3416609

work page doi:10.1145/3324884.3416609 2020
[9]

Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. InProceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 245–250. doi:10.1145/502512.502546

work page doi:10.1145/502512.502546 2001
[10]

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316(2016). https://arxiv.org/abs/1604.07316

work page internal anchor Pith review arXiv 2016
[11]

Taejoon Byun, Vaibhav Sharma, Abhishek Vijayakumar, Sanjai Rayadurgam, and Darren Cofer. 2019. Input prioritization for testing neural networks. In2019 IEEE International Conference On Artificial Intelligence Testing (AITest). IEEE, 63–70. doi:10.1109/AITest.2019.000-6

work page doi:10.1109/aitest.2019.000-6 2019
[12]

Junjie Chen, Zhuo Wu, Zan Wang, Hanmo You, Lingming Zhang, and Ming Yan. 2020. Practical accuracy estimation for efficient deep neural network testing.ACM Transactions on Software Engineering and Methodology (TOSEM)29, 4 (2020), 1–35. doi:10.1145/3394112

work page doi:10.1145/3394112 2020
[13]

Lingjiao Chen, Matei Zaharia, and James Y Zou. 2022. Estimating and explaining model performance when both covariates and labels shift.Advances in Neural Information Processing Systems35 (2022), 11467–11479. https: //dl.acm.org/doi/10.5555/3600270.3601103

work page doi:10.5555/3600270.3601103 2022
[14]

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators.arXiv preprint arXiv:2003.10555(2020). https://arxiv.org/abs/2003.10555

work page arXiv 2020
[15]

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. 2017. EMNIST: Extending MNIST to hand- written letters.2017 International Joint Conference on Neural Networks (IJCNN)(2017). doi:10.1109/ijcnn.2017.7966217

work page doi:10.1109/ijcnn.2017.7966217 2017
[16]

Demet Demir, Aysu Betin Can, and Elif Surer. 2024. Test selection for deep neural networks using meta-models with uncertainty metrics. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 678–690. doi:10.1145/3650212.3680312

work page doi:10.1145/3650212.3680312 2024
[17]

K. P. Devakumar. 2020. IMDB Review Classification LSTM, GRU, CNN, GloVe. https://www.kaggle.com/code/imdevskp/imdb-review-classification-lstm-gru-cnn-glove/

2020
[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. doi:10.18653...

work page doi:10.18653/v1/n19-1423 2019
[19]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, Vol. 96. 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507

work page doi:10.5555/3001460.3001507 1996
[20]

Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 177–188. doi:10.1145/3395363.3397357 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058...

work page doi:10.1145/3395363.3397357 2020
[21]

Xinyu Gao, Yang Feng, Yining Yin, Zixi Liu, Zhenyu Chen, and Baowen Xu. 2022. Adaptive test selection for deep neural networks. InProceedings of the 44th international conference on software engineering. 73–85. doi:10.1145/3510003.3510232

work page doi:10.1145/3510003.3510232 2022
[22]

Jacob Gildenblat. 2016. Visualizations for understanding the regressed wheel steering angle for self-driving cars. https://github.com/jacobgil/keras-steering-angle-visualizations

2016
[23]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572(2014). https://doi.org/10.48550/arXiv.1412.6572

work page internal anchor Pith review doi:10.48550/arxiv.1412.6572 2014
[24]

Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. 2017. Adversarial examples for malware detection. InComputer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part II 22. Springer, 62–79. doi:10.1007/978-3-319- 66399-9_4

work page doi:10.1007/978-3-319- 2017
[26]

Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2024. DeepSample: DNN sampling-based testing for operational accuracy assessment. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12. doi:10.1109/ICSE43902.2021.00042

work page doi:10.1109/icse43902.2021.00042 2024
[27]

Chris Gundling. 2016. Steering angle model: Cg32.https://github.com/udacity/self-driving-car/tree/master/steering- models/community-models/cg23

2016
[28]

Yao Hao, Zhiqiu Huang, Hongjing Guo, and Guohua Shen. 2023. Test input selection for deep neural network enhancement based on multiple-objective optimization. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 534–545. doi:10.1109/SANER56733.2023.00056

work page doi:10.1109/saner56733.2023.00056 2023
[29]

Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is neuron coverage a meaningful measure for testing deep neural networks?. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 851–862. doi:10.1145/3368089.3409754

work page doi:10.1145/3368089.3409754 2020
[30]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[31]

Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136(2016). doi:10.48550/arXiv.1610.02136

work page internal anchor Pith review doi:10.48550/arxiv.1610.02136 2016
[32]

Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. InProceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 168–177. doi:10.1145/1014052.1014073

work page doi:10.1145/1014052.1014073 2004
[33]

Qiang Hu, Yuejun Guo, Maxime Cordy, Xiaofei Xie, Lei Ma, Mike Papadakis, and Yves Le Traon. 2022. An empirical study on data distribution-aware test selection for deep learning enhancement.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 4 (2022), 1–30. doi:10.1145/3511598

work page doi:10.1145/3511598 2022
[34]

Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, and Yves Le Traon. 2024. Test optimization in dnn testing: A survey.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–42. doi:10.1145/ 3643678

2024
[35]

Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Wei Ma, Mike Papadakis, Lei Ma, and Yves Le Traon. 2025. Assessing the Robustness of Test Selection Methods for Deep Neural Networks.ACM Transactions on Software Engineering and Methodology(2025). doi:10.1145/3715693

work page doi:10.1145/3715693 2025
[36]

Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. Leakage in data mining: Formulation, detection, and avoidance.ACM Transactions on Knowledge Discovery from Data (TKDD)6, 4 (2012), 1–21. doi:10.1145/ 2382577.2382579

work page arXiv 2012
[37]

Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1039–1049. doi:10.1109/ICSE.2019.00108

work page doi:10.1109/icse.2019.00108 2019
[38]

Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. 2020. Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1466–1476. doi:10.1145/3368089. 3417065

work page doi:10.1145/3368089 2020
[39]

Ronald S King. 2013. Cluster analysis and data mining: An introduction. (2013). https://dl.acm.org/doi/abs/10.5555/ 2823861

2013
[40]

Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. InArtificial intelligence safety and security. Chapman and Hall/CRC, 99–112. doi:10.1201/9781351251389-8

work page doi:10.1201/9781351251389-8 2018
[41]

Yann LeCun. 1998. The MNIST database of handwritten digits.http://yann. lecun. com/exdb/mnist/(1998). doi:10. 24432/C53K8Q

1998
[42]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition.Proc. IEEE86, 11 (1998), 2278–2324. doi:10.1109/5.726791 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058. Publication date: July 2026. Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribut...

work page doi:10.1109/5.726791 1998
[43]

Deqiang Li and Qianmu Li. 2020. Adversarial deep ensemble: Evasion attacks and defenses for malware detection. IEEE Transactions on Information Forensics and Security15 (2020), 3886–3900. doi:10.1109/TIFS.2020.3003571

work page doi:10.1109/tifs.2020.3003571 2020
[44]

Deqiang Li, Tian Qiu, Shuo Chen, Qianmu Li, and Shouhuai Xu. 2021. Can we leverage predictive uncertainty to detect dataset shift and adversarial examples in android malware detection?. InProceedings of the 37th Annual Computer Security Applications Conference. 596–608. doi:10.1145/3485832.3485916

work page doi:10.1145/3485832.3485916 2021
[45]

Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, and Tegawendé F Bissyandé. 2024. Prioritizing test cases for deep learning-based video classifiers.Empirical Software Engineering29, 5 (2024), 111. doi:10.1007/s10664-024-10520-1

work page doi:10.1007/s10664-024-10520-1 2024
[46]

Yinghua Li, Xueqi Dang, Lei Ma, Jacques Klein, Yves Le Traon, and Tegawende F Bissyande. 2024. Test input prioritization for 3d point clouds.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–44. doi:10.1145/3643676

work page doi:10.1145/3643676 2024
[47]

Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational dnn testing efficiency through conditioning. InProceedings of the 2019 27th ACM Joint meeting on European software engineering conference and symposium on the foundations of software engineering. 499–509. doi:10.1145/3338906.3338930

work page doi:10.1145/3338906.3338930 2019
[48]

Jingjing Liang, Sebastian Elbaum, and Gregg Rothermel. 2018. Redefining prioritization: continuous prioritization for continuous integration. InProceedings of the 40th International Conference on Software Engineering. 688–698. doi:10.1145/3180155.3180213

work page doi:10.1145/3180155.3180213 2018
[49]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019). https://doi.org/10.48550/arXiv.1907.11692

work page internal anchor Pith review doi:10.48550/arxiv.1907.11692 2019
[50]

Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al
[51]

InProceedings of the 33rd ACM/IEEE international conference on automated software engineering

Deepgauge: Multi-granularity testing criteria for deep learning systems. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 120–131. doi:10.1145/3238147.3238202

work page doi:10.1145/3238147.3238202
[52]

Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–22. doi:10.1145/3417330

work page doi:10.1145/3417330 2021
[53]

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 142–150. https://dl.acm.org/doi/10.5555/2002472.2002491

work page doi:10.5555/2002472.2002491 2011
[54]

James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. InProceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Vol. 5. University of California press, 281–298. https://api.semanticscholar.org/CorpusID:6278891

1967
[55]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083(2017). https://doi.org/10.48550/arXiv. 1706.06083

work page internal anchor Pith review doi:10.48550/arxiv 2017
[56]

Leland McInnes, John Healy, Steve Astels, et al. 2017. hdbscan: Hierarchical density based clustering.J. Open Source Softw.2, 11 (2017), 205. doi:10.21105/joss.00205

work page doi:10.21105/joss.00205 2017
[57]

Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426(2018). doi:10.21105/joss.00861

work page doi:10.21105/joss.00861 2018
[58]

Niall McLaughlin, Jesus Martinez del Rincon, BooJoong Kang, Suleiman Yerima, Paul Miller, Sakir Sezer, Yeganeh Safaei, Erik Trickel, Ziming Zhao, Adam Doupé, et al. 2017. Deep android malware detection. InProceedings of the seventh ACM on conference on data and application security and privacy. 301–308. doi:10.1145/3029806.3029823

work page doi:10.1145/3029806.3029823 2017
[59]

Ecker, Matthias Bethge, and Wieland Brendel

Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S. Ecker, Matthias Bethge, and Wieland Brendel. 2019. Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming.arXiv preprint arXiv:1907.07484(2019). https://doi.org/10.48550/arXiv.1907.07484

work page doi:10.48550/arxiv.1907.07484 2019
[60]

Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift in classification.Pattern recognition45, 1 (2012), 521–530. doi:10.1016/j.patcog.2011.06.019

work page doi:10.1016/j.patcog.2011.06.019 2012
[61]

Vasilii Mosin, Miroslaw Staron, Darko Durisic, Francisco Gomes de Oliveira Neto, Sushant Kumar Pandey, and Ashok Chaitanya Koppisetty. 2022. Comparing input prioritization techniques for testing deep learning algorithms. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 76–83. doi:10.1109/ SEAA56994.2022.00020

work page arXiv 2022
[62]

Davoud Moulavi, Pablo A Jaskowiak, Ricardo JGB Campello, Arthur Zimek, and Jörg Sander. 2014. Density-based clustering validation. InProceedings of the 2014 SIAM international conference on data mining. SIAM, 839–847. doi:10. 1137/1.9781611973440.96

2014
[63]

Norman Mu and Justin Gilmer. 2019. Mnist-c: A robustness benchmark for computer vision.arXiv preprint arXiv:1906.02337(2019). https://doi.org/10.48550/arXiv.1906.02337

work page doi:10.48550/arxiv.1906.02337 2019
[64]

Apoorv Nandan. 2020. Text classification with Transformer.https://keras.io/examples/nlp/text_classification_with_ transformer/. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE058. Publication date: July 2026. FSE058:24 Jingyu Zhang, Fan Wang, Jacky Keung, Yihan Liao, Yan Xiao, and Lei Ma

2020
[65]

Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space.The London, Edinburgh, and Dublin philosophical magazine and journal of science2, 11 (1901), 559–572. doi:10.1080/14786440109462720

work page doi:10.1080/14786440109462720 1901
[66]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles. 1–18. doi:10.1145/3361566

work page doi:10.1145/3361566 2017
[67]

Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. InProceedings of the 57th annual meeting of the association for computational linguistics. 1085–1097. https://api.semanticscholar.org/CorpusID:196202909

2019
[68]

Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics20 (1987), 53–65. doi:10.1016/0377-0427(87)90125-7

work page doi:10.1016/0377-0427(87)90125-7 1987
[69]

Alexandru Constantin Serban, Erik Poll, and Joost Visser. 2018. Adversarial examples-a complete characterisation of the phenomenon.arXiv preprint arXiv:1810.01185(2018). doi:10.48550/arXiv.1810.01185

work page doi:10.48550/arxiv.1810.01185 2018
[70]

Weijun Shen, Yanhui Li, Lin Chen, Yuanlei Han, Yuming Zhou, and Baowen Xu. 2020. Multiple-boundary clustering and prioritization to promote neural network retraining. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 410–422. doi:10.1145/3324884.3416621

work page doi:10.1145/3324884.3416621 2020
[71]

Ying Shi, Beibei Yin, Zheng Zheng, and Tiancheng Li. 2021. An empirical study on test case prioritization metrics for deep neural networks. In2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS). IEEE, 157–166. doi:10.1109/QRS54544.2021.00027

work page doi:10.1109/qrs54544.2021.00027 2021
[72]

Alex Staravoitau. 2016. Behavioral cloning: end-to-end learning for self-driving cars. https://github.com/navoshta/behavioral-cloning

2016
[73]

Sully-Chen. 2016. Autopilot-Tensorflow.https://github.com/SullyChen/Autopilot-TensorFlow

2016
[74]

Weifeng Sun, Meng Yan, Zhongxin Liu, and David Lo. 2023. Robust test selection for deep neural networks.IEEE Transactions on Software Engineering49, 12 (2023), 5250–5278. doi:10.1109/TSE.2023.3330982

work page doi:10.1109/tse.2023.3330982 2023
[75]

Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. 2016. An empirical comparison of model validation techniques for defect prediction models.IEEE Transactions on Software Engineering43, 1 (2016), 1–18. doi:10.1109/TSE.2016.2584050

work page doi:10.1109/tse.2016.2584050 2016
[76]

TextAttack. [n. d.]. Transforms an input by replacing any word with ’banana’.https://textattack.readthedocs.io/en/latest/
[77]

Udacity. 2016. Using Deep Learning to Predict Steering Angles. https://medium.com/udacity/challenge-2-using-deep- learning-to-predict-steering-angles-f42004a36ff3

2016
[78]

Dan Wang and Yi Shang. 2014. A new active labeling method for deep learning. In2014 International joint conference on neural networks (IJCNN). IEEE, 112–119. doi:10.1109/IJCNN.2014.6889457

work page doi:10.1109/ijcnn.2014.6889457 2014
[79]

Zhiyu Wang, Sihan Xu, Lingling Fan, Xiangrui Cai, Linyu Li, and Zheli Liu. 2024. Can coverage criteria guide failure discovery for image classifiers? an empirical study.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–28. doi:10.1145/3672446

work page doi:10.1145/3672446 2024
[80]

Michael Weiss and Paolo Tonella. 2022. Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study). InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 139–150. doi:10.1145/3533767.3534375

work page doi:10.1145/3533767.3534375 2022
[81]

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? Advances in neural information processing systems27 (2014). https://api.semanticscholar.org/CorpusID:362467

2014

Showing first 80 references.