pith. machine review for the scientific record. sign in

arxiv: 2605.06894 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.LG

Recognition: no theorem link

McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:05 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords android malwareconcept driftmultimodal fusionlongitudinal benchmarktemporal generalizationmalware detectiondrift detectionmachine learning robustness
0
0 comments X

The pith

McNdroid benchmark shows multimodal fusion resists Android malware detection degradation better than single modalities over long time gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents McNdroid, a dataset of Android applications spanning 2013 to 2025 with three aligned modalities of features for each sample. It evaluates machine learning and deep learning detectors on temporally separated train-test splits that grow farther apart in years. Results establish that performance declines for all approaches as the gap increases, yet combining the modalities produces higher accuracy than any one alone in the longest gaps. Cross-modal agreement between the feature types also falls over time, indicating that drift alters both individual representations and their relationships. This construction supplies a public resource for examining how detectors can generalize in non-stationary security settings.

Core claim

McNdroid supplies a longitudinal multimodal benchmark of Android malware samples from 2013 to 2025, each represented by static features, dynamic behavioral features, and graph-based features. Evaluation on temporally separated splits demonstrates clear performance degradation as train-test time gaps widen, while multimodal fusion maintains superior accuracy compared with the best single modality across the longest gaps; cross-modal agreement likewise declines, revealing that drift affects both individual feature spaces and their consistency.

What carries the argument

The McNdroid dataset with its three aligned modalities and temporally separated splits that enable controlled measurement of concept drift.

If this is right

  • Detectors must incorporate drift-handling mechanisms to preserve accuracy over multi-year deployments.
  • Multimodal fusion should be preferred when models are expected to encounter samples from distant future periods.
  • Declining cross-modal agreement can serve as a practical signal for detecting the onset of drift.
  • Modality-specific drift analysis can guide selective feature updating rather than full model retraining.
  • Public release of the splits and code enables direct testing of adaptation techniques on the same data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security systems could adopt rolling data ingestion schedules modeled on the benchmark's yearly structure to limit degradation.
  • The patterns of malware family evolution visible in the longitudinal data may support family-aware detection modules.
  • Dynamic reweighting of modalities based on observed agreement could be tested as an extension of the fusion approach.
  • Similar temporal splits could be constructed for other non-stationary domains such as network intrusion detection to compare drift behaviors.

Load-bearing premise

The temporally separated splits and three aligned modalities accurately capture real-world concept drift without major biases from data collection, labeling, or sandbox execution across the full period.

What would settle it

A single-modality detector that maintains accuracy equal to or higher than multimodal fusion on the longest temporal gaps in the released splits would falsify the claim of multimodal superiority under drift.

Figures

Figures reproduced from arXiv: 2605.06894 by Aritran Piplai, Edward Raff, Emilia Rivas, Jesus Lopez, Md Ahsanul Haque, Md Mahmuduzzaman Kamol, Mohammad Saidur Rahman, Saeefa Rubaiyet Nowmi.

Figure 1
Figure 1. Figure 1: Overview of the McNdroid dataset creation process. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the multimodal fusion strategies evaluated in McNdroid. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Year-wise F1-score comparison of the best unimodal baseline, feature-fusion model, and static–graph multimodal model. 2013 2014 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Year Mean feature-wise KS Static Graph-based Dynamic [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Temporal stability of malware and benign samples measured by Jeffreys divergence. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Year-wise Fleiss’ Kappa with standard deviation shown as error bars. 2014 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Test Year Dissent Rate Static Graph-based Dynamic [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Year-wise visualization of the static feature space for [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Year-wise dynamic feature space visualization of [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Year-wise graph-based feature space visualization of [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Entropy comparison across modali￾ties and drifted–undrifted pairs. and each of the three feature modalities (static, graph-based, and dynamic), an XGBoost classifier is trained on 70% of the non-drifted samples and evaluated separately on the remaining 30% of non-drifted samples and on the full set of drifted samples. We measure prediction uncertainty using binary entropy and margin distance from the deci… view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of modality-specific behavior across malware families and years. Left: mean [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Per-family stability score distributions across three feature modalities for the top-20 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
read the original abstract

Machine learning (ML) in real-world systems must contend with concept drift, adversarial actors, and a spectrum of potential features with varying costs and benefits. Malware naturally exhibits all of these complexities, but for the same reason, it is challenging to curate and organize data to study these factors. We present McNdroid, to our knowledge the largest longitudinal multimodal Android malware benchmark for malware detection and drift analysis. McNdroid spans 2013--2025, excluding 2015, and represents each application with three aligned modalities--static features from manifests and smali code, dynamic behavioral features from sandbox execution, and graph-based features from function-call graphs. Using temporally separated splits, we evaluate standard ML and deep-learning detectors across increasing train--test time gaps. Results show clear temporal degradation, while multimodal fusion outperforms the best single modality across long-term temporal gaps. Cross-modal agreement also declines over time, suggesting that drift affects both individual feature spaces and the consistency among modalities. We further analyze modality-specific drift, malware-family evolution, and temporal changes in model explanations. We publicly release McNdroid, benchmark splits, and code to support reproducible research on temporal generalization and robust multimodal learning in security-critical, non-stationary settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces McNdroid, claimed to be the largest longitudinal multimodal Android malware benchmark spanning 2013-2025 (excluding 2015). Each sample is represented by three aligned modalities (static features from manifests and smali, dynamic behavioral features from sandbox execution, and graph-based features from function-call graphs). Using temporally separated train-test splits with increasing time gaps, the authors evaluate standard ML and deep-learning detectors, reporting clear temporal performance degradation, superior performance of multimodal fusion over the best single modality for long-term gaps, declining cross-modal agreement over time, and additional analyses of modality-specific drift, malware family evolution, and temporal changes in model explanations. The dataset, splits, and code are released publicly.

Significance. If the temporally separated splits isolate genuine concept drift without collection or labeling artifacts, this benchmark would be a valuable contribution to research on non-stationary malware detection and multimodal learning in security. The public release of data and code supports reproducibility, which strengthens its utility for the community studying temporal generalization.

major comments (3)
  1. [Dataset construction] Dataset construction section: The manuscript provides no details on label provenance, re-verification procedures, or controls for changes in AV engine behavior across 2013-2025. This is load-bearing because the central claim of temporal degradation (and declining cross-modal agreement) requires that observed drops reflect feature-space drift rather than time-varying label noise.
  2. [Experimental evaluation] Experimental evaluation section: No information is given on sandbox version pinning, Android OS normalization, or execution environment standardization over the collection period. Without this, performance degradation across temporal gaps could arise from evolving sandbox artifacts rather than malware distribution shifts, undermining the multimodal fusion and drift analysis results.
  3. [Results and analysis] Results and analysis section: The reported outperformance of multimodal fusion and cross-modal disagreement trends lack accompanying statistical tests, confidence intervals, or ablation on data quality controls, making it difficult to assess whether the improvements are robust or sensitive to the unaddressed temporal confounds.
minor comments (2)
  1. [Abstract] The reason for excluding 2015 from the 2013-2025 span is stated but not motivated; a brief justification would improve clarity.
  2. [Methodology] Notation for the three modalities is introduced without a summary table of feature dimensions or extraction costs, which would aid readers in interpreting the multimodal fusion experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The concerns raised about dataset labeling, experimental controls, and statistical rigor are substantive and directly relevant to the validity of our temporal drift claims. We address each major comment below and have prepared revisions to the manuscript that incorporate the requested details and additional analyses.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: The manuscript provides no details on label provenance, re-verification procedures, or controls for changes in AV engine behavior across 2013-2025. This is load-bearing because the central claim of temporal degradation (and declining cross-modal agreement) requires that observed drops reflect feature-space drift rather than time-varying label noise.

    Authors: We agree that the original manuscript omitted explicit details on label provenance and controls. In the revised version we will expand the Dataset Construction section with a new subsection that specifies: labels were obtained via VirusTotal using a fixed majority-vote threshold of at least five detections from a stable panel of AV engines; a 5% random subset was re-verified through manual static/dynamic analysis by two independent researchers with inter-rater agreement reported; and labeling scripts and raw VT metadata will be released with the dataset. These additions will demonstrate that the observed performance drops and cross-modal disagreement trends are driven by feature-space changes rather than time-varying label noise. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation section: No information is given on sandbox version pinning, Android OS normalization, or execution environment standardization over the collection period. Without this, performance degradation across temporal gaps could arise from evolving sandbox artifacts rather than malware distribution shifts, undermining the multimodal fusion and drift analysis results.

    Authors: We concur that environment standardization is critical. The revised Experimental Evaluation section will document that all samples were executed under a pinned Cuckoo Sandbox configuration with Android emulator fixed at API level 28 (with documented backward-compatible emulators for pre-2018 samples), identical 300-second timeout, network simulation, and trigger set. We will also add an ablation that recomputes key metrics under these controlled conditions and shows that the temporal degradation and multimodal gains persist, thereby confirming that the results reflect malware distribution shifts. revision: yes

  3. Referee: [Results and analysis] Results and analysis section: The reported outperformance of multimodal fusion and cross-modal disagreement trends lack accompanying statistical tests, confidence intervals, or ablation on data quality controls, making it difficult to assess whether the improvements are robust or sensitive to the unaddressed temporal confounds.

    Authors: We accept that statistical support and quality ablations are necessary. The revised Results and Analysis section will include: (i) paired t-tests and McNemar’s tests with p-values for multimodal versus best-single-modality comparisons at each temporal gap; (ii) 95% bootstrap confidence intervals on all accuracy and F1 scores; and (iii) an ablation that removes samples with low-confidence labels and verifies that the multimodal advantage and cross-modal disagreement trends remain statistically significant. These additions will strengthen the robustness claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark without derivations or self-referential predictions

full rationale

This paper introduces a new longitudinal multimodal Android malware dataset (McNdroid) spanning 2013-2025 and reports direct empirical evaluations of standard ML and deep-learning detectors on temporally separated train-test splits. No mathematical derivations, equations, fitted parameters, or predictions are claimed. The central results (temporal degradation, multimodal fusion outperforming single modalities, declining cross-modal agreement) are straightforward measurements on the released data and splits. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the findings; the work is self-contained and externally falsifiable via the public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new empirical dataset and evaluation protocol rather than new theoretical constructs. It relies on standard ML assumptions about temporal splits reflecting drift.

axioms (1)
  • domain assumption Temporally separated train-test splits with increasing gaps accurately reflect real-world concept drift in Android malware
    The evaluation uses temporally separated splits to study degradation over time.

pith-pipeline@v0.9.0 · 5552 in / 1239 out tokens · 35852 ms · 2026-05-11T01:05:47.133852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Bissyandé, Jacques Klein, and Yves Le Traon

    Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. Androzoo: Collecting millions of android apps for the research community. InInternational Conference on Mining Software Repositories (MSR), 2016

  2. [2]

    EMBER: an open dataset for training static pe malware machine learning models.arXiv preprint, 2018

    Hyrum S Anderson and Phil Roth. EMBER: an open dataset for training static pe malware machine learning models.arXiv preprint, 2018

  3. [3]

    Obfuscapk: An open-source black-box obfuscation tool for android apps.(SoftwareX), 2020

    Simone Aonzo, Gabriel Claudiu Georgiu, Luca Verderame, and Alessio Merlo. Obfuscapk: An open-source black-box obfuscation tool for android apps.(SoftwareX), 2020

  4. [4]

    Gated multimodal units for information fusion.arXiv preprint, 2017

    John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. Gated multimodal units for information fusion.arXiv preprint, 2017

  5. [5]

    Drebin: Effective and explainable detection of android malware in your pocket

    Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. Drebin: Effective and explainable detection of android malware in your pocket. In Network and Distributed System Security Symposium (NDSS), 2014

  6. [6]

    Dos and don’ts of machine learning in computer security

    Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. Dos and don’ts of machine learning in computer security. InUSENIX Security Symposium, 2022

  7. [7]

    Detecting concept drift with neural network model uncertainty.arXiv preprint arXiv:2107.01873, 2021

    Lucas Baier, Tim Schlör, Jakob Schöffer, and Niklas Kühl. Detecting concept drift with neural network model uncertainty.arXiv preprint arXiv:2107.01873, 2021

  8. [8]

    The impact of api change-and fault-proneness on the user ratings of android apps.IEEE Transactions on Software Engineering (TSE), 2014

    Gabriele Bavota, Mario Linares-Vasquez, Carlos Eduardo Bernal-Cardenas, Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. The impact of api change-and fault-proneness on the user ratings of android apps.IEEE Transactions on Software Engineering (TSE), 2014

  9. [9]

    Kolmogorov–smirnov test: Overview.Wiley statsref: Statistics reference online, 2014

    Vance W Berger and YanYan Zhou. Kolmogorov–smirnov test: Overview.Wiley statsref: Statistics reference online, 2014

  10. [10]

    Early, inter- mediate and late fusion strategies for robust deep learning-based multimodal action recognition

    Said Yacine Boulahia, Abdenour Amamra, Mohamed Ridha Madi, and Said Daikh. Early, inter- mediate and late fusion strategies for robust deep learning-based multimodal action recognition. Machine Vision and Applications, 2021

  11. [11]

    Assessing and improving malware detection sustainability through app evolution studies.ACM Transactions on Software Engineering and Methodology (TOSEM), 2020

    Haipeng Cai. Assessing and improving malware detection sustainability through app evolution studies.ACM Transactions on Software Engineering and Methodology (TOSEM), 2020

  12. [12]

    Ricardo J. G. B. Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierarchical density estimates. InPacific-Asia Conference on Knowledge Discovery and Data Mining (PA-KDD), 2013

  13. [13]

    Towards multimodal sarcasm detection (an _obviously_ perfect paper)

    Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection (an _obviously_ perfect paper). InAnnual Meeting of the Association for Computational Linguistics (ACL), 2019

  14. [14]

    arXiv preprint arXiv:1811.03728 (2018)

    Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering.arXiv preprint arXiv:1811.03728, 2018. 10

  15. [15]

    Higraph: A large-scale hierarchical graph dataset for malware analysis.arXiv preprint arXiv:2509.02113, 2025

    Han Chen, Hanchen Wang, Hongmei Chen, Ying Zhang, Lu Qin, and Wenjie Zhang. Higraph: A large-scale hierarchical graph dataset for malware analysis.arXiv preprint arXiv:2509.02113, 2025

  16. [16]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2016

  17. [17]

    On training robust {PDF} malware classifiers

    Yizheng Chen, Shiqi Wang, Dongdong She, and Suman Jana. On training robust {PDF} malware classifiers. InUSENIX Security Symposium, 2020

  18. [18]

    Continuous learning for android malware detection

    Yizheng Chen, Zhoujie Ding, and David Wagner. Continuous learning for android malware detection. InUSENIX Security Symposium, 2023

  19. [19]

    Breaking out from the tesseract: Reassessing ml-based malware detection under spatio-temporal drift.arXiv preprint arXiv:2506.23814, 2025

    Theo Chow, Mario D’Onghia, Lorenz Linhardt, Zeliang Kan, Daniel Arp, Lorenzo Cavallaro, and Fabio Pierazzi. Breaking out from the tesseract: Reassessing ml-based malware detection under spatio-temporal drift.arXiv preprint arXiv:2506.23814, 2025

  20. [20]

    Androguard: Reverse engineering, malware and goodware analysis of android applications.https://github.com/androguard/androguard, 2011

    Anthony Desnos. Androguard: Reverse engineering, malware and goodware analysis of android applications.https://github.com/androguard/androguard, 2011

  21. [21]

    Anoshift: A distribution shift benchmark for unsupervised anomaly detection.Advances in Neural Information Processing Systems (NeurIPS), 2022

    Marius Dragoi, Elena Burceanu, Emanuela Haller, Andrei Manolache, and Florin Brad. Anoshift: A distribution shift benchmark for unsupervised anomaly detection.Advances in Neural Information Processing Systems (NeurIPS), 2022

  22. [22]

    Taintdroid: an information- flow tracking system for realtime privacy monitoring on smartphones.ACM Transactions on Computer Systems (TOCS), 2014

    William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N Sheth. Taintdroid: an information- flow tracking system for realtime privacy monitoring on smartphones.ACM Transactions on Computer Systems (TOCS), 2014

  23. [23]

    Threat landscape

    European Union Agency for Cybersecurity (ENISA). Threat landscape. https://www. enisa.europa.eu/topics/cyber-threats/threat-landscape, 2025. URL https: //www.enisa.europa.eu/topics/cyber-threats/threat-landscape. Accessed 2025- 11-27

  24. [24]

    Automated api-usage update for android apps

    Mattia Fazzini, Qi Xin, and Alessandro Orso. Automated api-usage update for android apps. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis (SIGSOFT), 2019

  25. [25]

    Fedmultimodal: A benchmark for multimodal federated learning

    Tiantian Feng, Digbalay Bose, Tuo Zhang, Rajat Hebbar, Anil Ramakrishna, Rahul Gupta, Mi Zhang, Salman Avestimehr, and Shrikanth Narayanan. Fedmultimodal: A benchmark for multimodal federated learning. InACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2023

  26. [26]

    Measuring nominal scale agreement among many raters.American Psycholog- ical Association Psychological bulletin, 1971

    Joseph L Fleiss. Measuring nominal scale agreement among many raters.American Psycholog- ical Association Psychological bulletin, 1971

  27. [27]

    A multimodal approach for human activity recognition based on skeleton and rgb data.Pattern Recognition Letters, 2020

    Annalisa Franco, Antonio Magnani, and Dario Maio. A multimodal approach for human activity recognition based on skeleton and rgb data.Pattern Recognition Letters, 2020

  28. [28]

    A large-scale database for graph representation learning.Advances in Neural Information Processing Systems (NeurIPS), 2021

    Scott Freitas, Yuxiao Dong, Joshua Neil, and Duen Horng Chau. A large-scale database for graph representation learning.Advances in Neural Information Processing Systems (NeurIPS), 2021

  29. [29]

    Malnet: A large-scale image database of malicious software

    Scott Freitas, Rahul Duggal, and Duen Horng Chau. Malnet: A large-scale image database of malicious software. InACM International Conference on Information & Knowledge Manage- ment (CIKM), 2022

  30. [30]

    A compre- hensive study of learning-based android malware detectors under challenging environments

    Cuiying Gao, Gaozhun Huang, Heng Li, Bang Wu, Yueming Wu, and Wei Yuan. A compre- hensive study of learning-based android malware detectors under challenging environments. In IEEE/ACM International Conference on Software Engineering (ICSE), 2024

  31. [31]

    Inductive representation learning on large graphs.Advances in Neural Information Processing Systems (NeurIPS), 2017

    Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.Advances in Neural Information Processing Systems (NeurIPS), 2017. 11

  32. [32]

    LAMDA: A longitudinal android malware benchmark for concept drift analysis

    Md Ahsanul Haque, Ismail Hossain, Md Mahmuduzzaman Kamol, Md Jahangir Alam, Suresh Kumar Amalapuram, Sajedul Talukder, and Mohammad Saidur Rahman. LAMDA: A longitudinal android malware benchmark for concept drift analysis. InInternational Conference on Learning Representations (ICLR), 2026

  33. [33]

    Ur-funny: A multimodal language dataset for understanding humor

    Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed Ehsan Hoque. Ur-funny: A multimodal language dataset for understanding humor. InConference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019

  34. [34]

    Efficient query-based attack against ml- based android malware detection under zero knowledge setting

    Ping He, Yifan Xia, Xuhong Zhang, and Shouling Ji. Efficient query-based attack against ml- based android malware detection under zero knowledge setting. InACM SIGSAC Conference on Computer and Communications Security (CCS), 2023

  35. [35]

    Msdroid: Identifying malicious snippets for android malware detection.IEEE Transactions on Dependable and Secure Computing (TDSC), 2023

    Yiling He, Yiping Liu, Lei Wu, Ziqi Yang, Kui Ren, and Zhan Qin. Msdroid: Identifying malicious snippets for android malware detection.IEEE Transactions on Dependable and Secure Computing (TDSC), 2023

  36. [36]

    Hindroid: An intelligent android malware detection system based on structured heterogeneous information network

    Shifu Hou, Yanfang Ye, Yangqiu Song, and Melih Abdulhayoglu. Hindroid: An intelligent android malware detection system based on structured heterogeneous information network. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017

  37. [37]

    An invariant form for the prior probability in estimation problems.Proceedings of the Royal Society of London

    Harold Jeffreys. An invariant form for the prior probability in estimation problems.Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 1946

  38. [38]

    Mimic-iii, a freely accessible critical care database.Scientific data, 2016

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 2016

  39. [39]

    EMBER2024–A Benchmark Dataset for Holistic Evaluation of Malware Classifiers.arXiv preprint arXiv:2506.05074, 2025

    Robert J Joyce, Gideon Miller, Phil Roth, Richard Zak, Elliott Zaresky-Williams, Hyrum Anderson, Edward Raff, and James Holt. EMBER2024–A Benchmark Dataset for Holistic Evaluation of Malware Classifiers.arXiv preprint arXiv:2506.05074, 2025

  40. [40]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  41. [41]

    Lightgbm: A highly efficient gradient boosting decision tree.Advances in Neural Information Processing Systems (NeurIPS), 2017

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in Neural Information Processing Systems (NeurIPS), 2017

  42. [42]

    A multimodal deep learning method for android malware detection using various features.IEEE Transactions on Information Forensics and Security (TIFS), 2018

    TaeGuen Kim, BooJoong Kang, Mina Rho, Sakir Sezer, and Eul Gyu Im. A multimodal deep learning method for android malware detection using various features.IEEE Transactions on Information Forensics and Security (TIFS), 2018

  43. [43]

    The droid is in the details: Environment-aware evasion of android sandboxes

    Brian Kondracki, Babak Amin Azad, Najmeh Miramirkhani, and Nick Nikiforakis. The droid is in the details: Environment-aware evasion of android sandboxes. InNetwork and Distributed System Security Symposium (NDSS), 2022

  44. [44]

    Feature shift detection: Localizing which features have shifted via conditional distribution tests.Advances in Neural Information Processing Systems (NeurIPS), 2020

    Sean Kulinski, Saurabh Bagchi, and David I Inouye. Feature shift detection: Localizing which features have shifted via conditional distribution tests.Advances in Neural Information Processing Systems (NeurIPS), 2020

  45. [45]

    Darkgate malware exploits samba file shares in short-lived cam- paign.The Hacker News, 2024

    Ravie Lakshmanan. Darkgate malware exploits samba file shares in short-lived cam- paign.The Hacker News, 2024. URL https://thehackernews.com/2024/07/ darkgate-malware-exploits-samba-file.html

  46. [46]

    Toward devel- oping a systematic approach to generate benchmark android malware datasets and classification

    Arash Habibi Lashkari, Andi Fitriah A Kadir, Laya Taheri, and Ali A Ghorbani. Toward devel- oping a systematic approach to generate benchmark android malware datasets and classification. In2018 International Carnahan conference on security technology (ICCST), 2018. 12

  47. [47]

    Multi- modal sensor fusion with differentiable filters

    Michelle A Lee, Brent Yi, Roberto Martín-Martín, Silvio Savarese, and Jeannette Bohg. Multi- modal sensor fusion with differentiable filters. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020

  48. [48]

    Making sense of vision and touch: Learning multimodal representations for contact-rich tasks.IEEE Transactions on Robotics (T-RO), 2020

    Michelle A Lee, Yuke Zhu, Peter Zachares, Matthew Tan, Krishnan Srinivasan, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. Making sense of vision and touch: Learning multimodal representations for contact-rich tasks.IEEE Transactions on Robotics (T-RO), 2020

  49. [49]

    Evedroid: Event-aware android malware detection against model degrading for iot devices.IEEE Internet of Things Journal (IoT-J), 2019

    Tao Lei, Zhan Qin, Zhibo Wang, Qi Li, and Dengpan Ye. Evedroid: Event-aware android malware detection against model degrading for iot devices.IEEE Internet of Things Journal (IoT-J), 2019

  50. [50]

    Enrico: A dataset for topic modeling of mobile ui designs

    Luis A Leiva, Asutosh Hota, and Antti Oulasvirta. Enrico: A dataset for topic modeling of mobile ui designs. InInternational Conference on Human-Computer Interaction with Mobile Devices and Services, 2020

  51. [51]

    Revisiting concept drift in windows malware detection: Adaptation to real drifted malware with minimal samples

    Adrian Shuai Li, Arun Iyengar, Ashish Kundu, and Elisa Bertino. Revisiting concept drift in windows malware detection: Adaptation to real drifted malware with minimal samples. In Network and Distributed System Security Symposium (NDSS), 2025

  52. [52]

    Robust android malware detection against adversarial example attacks

    Heng Li, Shiyao Zhou, Wei Yuan, Xiapu Luo, Cuiying Gao, and Shuiyan Chen. Robust android malware detection against adversarial example attacks. InWeb Conference (WWW), 2021

  53. [53]

    Multibench: Multiscale benchmarks for multimodal representation learning.Advances in Neural Information Processing Systems (NeurIPS), 2021

    Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning.Advances in Neural Information Processing Systems (NeurIPS), 2021

  54. [54]

    Multizoo and multibench: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 2023

    Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, and Ruslan Salakhutdinov. Multizoo and multibench: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 2023

  55. [55]

    Unraveling the key of machine learning-based android malware detection.ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

    Jiahao Liu, Jun Zeng, Fabio Pierazzi, Ziqi Yang, Lorenzo Cavallaro, and Zhenkai Liang. Unraveling the key of machine learning-based android malware detection.ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

  56. [56]

    A unified approach to interpreting model predictions

    Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  57. [57]

    Dynamic android malware category classification using semi-supervised deep learning

    Samaneh Mahdavifar, Andi Fitriah Abdul Kadir, Rasool Fatemi, Dima Alhadidi, and Ali A Ghorbani. Dynamic android malware category classification using semi-supervised deep learning. In2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on ...

  58. [58]

    Mamadroid: Detecting android malware by building markov chains of behavioral models

    Enrico Mariconti, Lucky Onwuzurike, Panagiotis Andriotis, Emiliano De Cristofaro, Gianluca Ross, and Gianluca Stringhini. Mamadroid: Detecting android malware by building markov chains of behavioral models. InNetwork and Distributed System Security Symposium (NDSS), 2017

  59. [59]

    Pbp: Post-training backdoor purification for malware classifiers.arXiv preprint arXiv:2412.03441, 2024

    Dung Thuy Nguyen, Ngoc N Tran, Taylor T Johnson, and Kevin Leach. Pbp: Post-training backdoor purification for malware classifiers.arXiv preprint arXiv:2412.03441, 2024

  60. [60]

    Your mobile app, their playground: The dark side of virtualization, 2025

    Fernando Ortega and Vishnu Pratapagiri. Your mobile app, their playground: The dark side of virtualization, 2025. URL https://zimperium.com/blog/ your-mobile-app-their-playground-the-dark-side-of-the-virtualization

  61. [61]

    MalCL: Lever- aging gan-based generative replay to combat catastrophic forgetting in malware classification

    Jimin Park, AHyun Ji, Minji Park, Mohammad Saidur Rahman, and Se Eun Oh. MalCL: Lever- aging gan-based generative replay to combat catastrophic forgetting in malware classification. InAAAI Conference on Artificial Intelligence, 2025

  62. [62]

    TESSERACT: Eliminating experimental bias in malware classification across space and time

    Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. TESSERACT: Eliminating experimental bias in malware classification across space and time. InUSENIX Security Symposium, 2019. 13

  63. [63]

    Mfas: Multimodal fusion architecture search

    Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. Mfas: Multimodal fusion architecture search. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPRW), 2019

  64. [64]

    Coull, and Matthew Wright

    Mohammad Saidur Rahman, Scott E. Coull, and Matthew Wright. On the Limitations of Continual Learning for Malware Classification. InConference on Lifelong Learning Agents (CoLLAs), 2022

  65. [65]

    MADAR: Efficient continual learning for malware analysis with distribution-aware replay

    Mohammad Saidur Rahman, Scott Coull, Qi Yu, and Matthew Wright. MADAR: Efficient continual learning for malware analysis with distribution-aware replay. InConference on Applied Machine Learning in Information Security (CAMLIS), 2025

  66. [66]

    Droidchameleon: Evaluating android anti- malware against transformation attacks

    Vaibhav Rastogi, Yan Chen, and Xuxian Jiang. Droidchameleon: Evaluating android anti- malware against transformation attacks. InACM SIGSAC Conference on Computer and Com- munications Security (CCS), 2013

  67. [67]

    Ex- perience replay for continual learning

    David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Ex- perience replay for continual learning. InNeural Information Processing Systems (NeurIPS), 2019

  68. [68]

    Avclass2: Massive malware tag extraction from av labels

    Silvia Sebastián and Juan Caballero. Avclass2: Massive malware tag extraction from av labels. InAnnual Computer Security Applications Conference (ACSAC), 2020

  69. [69]

    Integrated static and dynamic analysis for malware detection

    PV Shijo and AJPCS Salim. Integrated static and dynamic analysis for malware detection. Procedia Computer Science, 2015

  70. [70]

    Recent advances in malware detection: Graph learning and explainability.arXiv preprint arXiv:2502.10556, 2025

    Hossein Shokouhinejad, Roozbeh Razavi-Far, Hesamodin Mohammadian, Mahdi Rabbani, Samuel Ansong, Griffin Higgins, and Ali A Ghorbani. Recent advances in malware detection: Graph learning and explainability.arXiv preprint arXiv:2502.10556, 2025

  71. [71]

    Detection of malicious pdf files based on hierarchical document structure

    Nedim Šrndic and Pavel Laskov. Detection of malicious pdf files based on hierarchical document structure. InNetwork and Distributed System Security Symposium (NDSS), 2013

  72. [72]

    Copperdroid: automatic reconstruction of android malware behaviors

    Kimberly Tam, Salahuddin J Khan, Aristide Fattori, and Lorenzo Cavallaro. Copperdroid: automatic reconstruction of android malware behaviors. InNetwork and Distributed System Security Symposium (NDSS), 2015

  73. [73]

    Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

    Gido M van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual learning and catastrophic forgetting.arXiv preprint arXiv:2403.05175, 2024

  74. [74]

    Centralnet: a multi- layer approach for multimodal fusion

    Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. Centralnet: a multi- layer approach for multimodal fusion. InEuropean Conference on Computer Vision (ECCV) workshops, 2018

  75. [75]

    VirusTotal — virustotal.com

    Virustotal. VirusTotal — virustotal.com. https://www.virustotal.com/gui/ intelligence-overview. [Accessed 21-10-2025]

  76. [76]

    VirusTotal – Stats, 2025

    VirusTotal. VirusTotal – Stats, 2025. https://www.virustotal.com/gui/stats

  77. [77]

    Feature hashing for large scale multitask learning

    Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. InInternational Conference on Machine Learning (ICML), 2009

  78. [78]

    Bozhi Wu, Sen Chen, Cuiyun Gao, Lingling Fan, Yang Liu, Weiping Wen, and Michael R. Lyu. Why an android app is classified as malware: Toward malware classification interpretation. ACM Transactions on Software Engineering and Methodology (TOSEM), 2021

  79. [79]

    Malscan: Fast market-wide mobile malware scanning by social-network centrality analysis

    Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. Malscan: Fast market-wide mobile malware scanning by social-network centrality analysis. InIEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2019

  80. [80]

    Homdroid: detecting android covert malware by social-network homophily analysis

    Yueming Wu, Deqing Zou, Wei Yang, Xiang Li, and Hai Jin. Homdroid: detecting android covert malware by social-network homophily analysis. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2021. 14

Showing first 80 references.