Recognition: unknown
Image-Based Malware Type Classification on MalNet-Image Tiny: Effects of Multi-Scale Fusion, Transfer Learning, Data Augmentation, and Schedule-Free Optimization
Pith reviewed 2026-05-09 23:36 UTC · model grok-4.3
The pith
Pretraining and data augmentations lift macro-F1 to 0.6927 on 43-class malware image classification with ResNet18 and FPN.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that the configuration using ImageNet pretraining, Mixup, TrivialAugment and a feature pyramid network on ResNet18 produces the highest metrics on the MalNet-Image Tiny test set, with pretraining and augmentation responsible for most of the macro-F1 improvement over the reproduced baseline of 0.6510.
What carries the argument
Feature pyramid network (FPN) for multi-scale fusion attached to ResNet18 to address scale variation from resizing binaries of unequal lengths.
If this is right
- ImageNet pretraining supplies the largest single lift in macro-F1 for this 43-class task.
- Mixup and TrivialAugment improve robustness with low overhead on binary-derived images.
- Schedule-free AdamW reaches near-baseline performance in 10 epochs instead of 96.
- FPN addition mainly raises macro-precision, macro-AUC and lowers test loss once pretraining and augmentation are present.
Where Pith is reading between the lines
- The same four components could be tested on larger binary-image malware collections to check whether the observed ranking persists.
- Faster convergence from schedule-free optimization may cut compute costs when scanning new Android APK collections.
- Prioritizing pretraining and simple augmentations first could serve as a practical recipe for adapting image classifiers to new malware families.
Load-bearing premise
The performance gains from pretraining, Mixup, TrivialAugment and FPN will hold on other malware image datasets or under real-world distribution shifts beyond the fixed MalNet-Image Tiny split.
What would settle it
Re-running the exact same four configurations on a different public malware image dataset and finding that the reported best setup no longer exceeds the plain baseline would show the gains do not generalize.
Figures
read the original abstract
This paper studies 43-class malware type classification on MalNet-Image Tiny, a public benchmark derived from Android APK files. The goal is to assess whether a compact image classifier benefits from four components evaluated in a controlled ablation: a feature pyramid network (FPN) for scale variation induced by resizing binaries of different lengths, ImageNet pretraining, lightweight augmentation through Mixup and TrivialAugment, and schedule-free AdamW optimization. All experiments use a ResNet18 backbone and the provided train/validation/test split. Reproducing the benchmark-style configuration yields macro-F1 (F1_macro) of 0.6510, consistent with the reported baseline of approximately 0.65. Replacing the optimizer with schedule-free AdamW and using unweighted cross-entropy increases F1_macro to 0.6535 in 10 epochs, compared with 96 epochs for the reproduced baseline. The best configuration combines pretraining, Mixup, TrivialAugment, and FPN, reaching F1_macro=0.6927, P_macro=0.7707, AUC_macro=0.9556, and L_test=0.8536. The ablation indicates that the largest gains in F1_macro arise from pretraining and augmentation, whereas FPN mainly improves P_macro, AUC_macro, and L_test in the strongest configuration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper conducts a controlled ablation study on 43-class malware type classification on the MalNet-Image Tiny benchmark using a ResNet18 backbone. It evaluates four components: feature pyramid network (FPN) for multi-scale fusion to handle variable binary lengths, ImageNet pretraining, lightweight data augmentation (Mixup + TrivialAugment), and schedule-free AdamW optimization. The reproduced baseline reaches F1_macro=0.6510; the best configuration (pretraining + augmentations + FPN) reaches F1_macro=0.6927, P_macro=0.7707, AUC_macro=0.9556, with pretraining and augmentation identified as the primary sources of F1 gains.
Significance. If the results prove robust, the work supplies a useful, reproducible empirical reference for applying standard computer-vision techniques to malware image classification. The precise baseline reproduction, the focus on schedule-free optimization for faster convergence, and the explicit attribution of gains to individual components are positive contributions that could guide efficient training on similar fixed-split benchmarks.
major comments (1)
- The ablation results and headline claims rest on single training runs on the fixed train/val/test split, with no reported standard deviations across random seeds, no error bars, and no statistical significance tests on the observed deltas (e.g., F1_macro improvement of 0.0417). Given that ResNet18 training with cross-entropy, Mixup, and TrivialAugment exhibits run-to-run variance that can easily exceed 0.03–0.05 in F1_macro, the attribution of largest gains to pretraining and augmentation cannot be considered reliably demonstrated.
minor comments (1)
- The abstract states that the schedule-free configuration reaches its result in 10 epochs versus 96 for the baseline; the manuscript should explicitly confirm that total wall-clock compute or convergence criteria are comparable before claiming efficiency gains.
Simulated Author's Rebuttal
We thank the referee for highlighting an important aspect of our experimental design. We respond to the major comment as follows.
read point-by-point responses
-
Referee: The ablation results and headline claims rest on single training runs on the fixed train/val/test split, with no reported standard deviations across random seeds, no error bars, and no statistical significance tests on the observed deltas (e.g., F1_macro improvement of 0.0417). Given that ResNet18 training with cross-entropy, Mixup, and TrivialAugment exhibits run-to-run variance that can easily exceed 0.03–0.05 in F1_macro, the attribution of largest gains to pretraining and augmentation cannot be considered reliably demonstrated.
Authors: We agree that this is a valid criticism and that reporting results from single runs limits the strength of our conclusions regarding the source of performance gains. Although the use of a fixed train/val/test split is standard for this benchmark to ensure direct comparability, the potential for run-to-run variance means that the observed improvements, particularly the 0.0417 increase in F1_macro, should be interpreted with caution. In the revised manuscript, we will rerun the key experiments (baseline, best configuration, and main ablations) using at least five different random seeds, report the mean and standard deviation for all metrics, and include error bars in the relevant tables and figures. This will enable a more robust assessment of the contributions from pretraining and data augmentation. revision: yes
Circularity Check
No circularity in empirical ablation study on fixed dataset split
full rationale
The paper performs controlled ablation experiments training ResNet18 classifiers on the MalNet-Image Tiny benchmark and reports direct empirical metrics (F1_macro, P_macro, AUC_macro, L_test) measured on the provided held-out test set. All headline numbers (e.g., baseline 0.6510 vs. best configuration 0.6927) are obtained by running the models and computing standard classification scores; no equations, derivations, or fitted parameters are presented as predictions that reduce to their own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is therefore self-contained as a set of reproducible empirical measurements rather than a deductive chain.
Axiom & Free-Parameter Ledger
free parameters (2)
- initial learning rate and other optimizer hyperparameters
- Mixup alpha and TrivialAugment magnitude
axioms (2)
- domain assumption The provided train/validation/test split is representative and fixed for all experiments.
- domain assumption Macro-averaged metrics are the appropriate summary for a 43-class imbalanced problem.
Reference graph
Works this paper leans on
-
[1]
Malware images: visualization and automatic classification,
L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, “Malware images: visualization and automatic classification,” inProceedings of the 8th International Symposium on Visualization for Cyber Security, ser. VizSec ’11. New York, NY , USA: Association for Computing Machinery, Jul. 2011, pp. 1–7. [Online]. Available: https://doi.org/10.1145/2016904.201...
-
[2]
MalNet: A Large-Scale Image Database of Malicious Software,
S. Freitas, R. Duggal, and D. H. Chau, “MalNet: A Large-Scale Image Database of Malicious Software,” Sep. 2022, arXiv:2102.01072 [cs]. [Online]. Available: http://arxiv.org/abs/2102.01072
-
[3]
An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,
F. Demirknran, A. Cayhr, U. ¨Unal, and H. Da ˘g, “An ensemble of pre-trained transformer models for imbalanced multiclass malware classification,”Computers & Security, vol. 121, p. 102846, Oct. 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0167404822002401
2022
-
[4]
Malware Detection and Classification Using fastText and BERT,
S. Yesir and I. Sogukpinar, “Malware Detection and Classification Using fastText and BERT,” in2021 9th International Symposium on Digital Forensics and Security (ISDFS), Jun. 2021, pp. 1–6. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9486377
-
[5]
Hybrid sequence-based Android malware detection using natural language processing,
N. Zhang, J. Xue, Y . Ma, R. Zhang, T. Liang, and Y .-a. Tan, “Hybrid sequence-based Android malware detection using natural language processing,”International Journal of Intelligent Systems, vol. 36, no. 10, pp. 5770–5784, 2021. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/int.22529
-
[6]
DLGraph: Malware Detection Using Deep Learning and Graph Embedding,
H. Jiang, T. Turki, and J. T. L. Wang, “DLGraph: Malware Detection Using Deep Learning and Graph Embedding,” in2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec. 2018, pp. 1029–1033. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8614193
-
[7]
AMalNet: A deep learning framework based on graph convolutional networks for malware detection,
X. Pei, L. Yu, and S. Tian, “AMalNet: A deep learning framework based on graph convolutional networks for malware detection,”Computers & Security, vol. 93, p. 101792, Jun. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404820300778
2020
-
[8]
GDroid: Android malware detection and classification with graph convolutional network,
H. Gao, S. Cheng, and W. Zhang, “GDroid: Android malware detection and classification with graph convolutional network,”Computers & Security, vol. 106, p. 102264, Jul. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404821000882
2021
-
[9]
Radon transform based malware classification in cyber-physical system using deep learning,
R. Alguliyev, R. Aliguliyev, and L. Sukhostat, “Radon transform based malware classification in cyber-physical system using deep learning,”Results in Control and Optimization, vol. 14, p. 100382, Mar. 2024. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S2666720724000122
2024
-
[10]
Stamina: Scalable deep learning approach for malware classification,
L. Chen, R. Sahita, J. Parikh, and M. Marino, “Stamina: Scalable deep learning approach for malware classification,” Intel and Microsoft, Tech. Rep., 2020. [Online]. Available: https://www.microsoft.com/ en-us/research/uploads/prod/2020/05/stamina.pdf
2020
-
[11]
SDIF-CNN: Stacking deep image features using fine-tuned convolution neural network models for real-world malware detection and classification,
S. Kumar and K. Panda, “SDIF-CNN: Stacking deep image features using fine-tuned convolution neural network models for real-world malware detection and classification,”Applied Soft Computing, vol. 146, p. 110676, Oct. 2023. [Online]. Available: https://www.sciencedirect. com/science/article/pii/S1568494623006944
2023
-
[12]
Virus-mnist: A benchmark malware dataset,
D. Noever and S. E. M. Noever, “Virus-mnist: A benchmark malware dataset,”arXiv preprint arXiv:2103.00602, 2021. [Online]. Available: https://arxiv.org/abs/2103.00602
-
[13]
Androdex: Android dex images of obfuscated malware,
A. Khan, M. Usama, B. B. Kamal, A. Ahmad, H. Malik, and S. Lee, “Androdex: Android dex images of obfuscated malware,” Scientific Data, 2024. [Online]. Available: https://www.nature.com/ articles/s41597-024-03027-3
2024
-
[14]
Mcafee dataset for malware detection,
M. Labs, “Mcafee dataset for malware detection,” 2020. [Online]. Available: https://www.mcafee.com/enterprise/en-us/assets/ white-papers/wp-machine-learning-malware-detection.pdf
2020
-
[15]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
2016
-
[16]
How transferable are features in deep neural networks?
J. Yosinski, J. Clune, Y . Bengio, and H. Lipson, “How transferable are features in deep neural networks?”Advances in neural information processing systems, vol. 27, pp. 3320–3328, 2014
2014
-
[17]
The road less scheduled.arXiv [cs.LG], 2024
A. Defazio, Xingyu, Yang, H. Mehta, K. Mishchenko, A. Khaled, and A. Cutkosky, “The Road Less Scheduled,” May 2024, arXiv:2405.15682 [cs, math, stat]. [Online]. Available: http://arxiv.org/abs/2405.15682
-
[18]
mixup: Beyond Empirical Risk Minimization
H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” Apr. 2018, arXiv:1710.09412 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1710.09412
work page internal anchor Pith review arXiv 2018
-
[19]
Trivialaugment: Tuning-free yet state-of-the- art data augmentation,
R. Mueller and F. Hutter, “Trivialaugment: Tuning-free yet state-of-the- art data augmentation,”arXiv preprint arXiv:2103.10158, 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.