ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection
Pith reviewed 2026-06-27 06:30 UTC · model grok-4.3
The pith
ViPER conditions malware predictions on inferred packing state via a dual-head vision model to handle packed executables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, it employs frequency-weighted losses with stratified sampling over joint class-packing strata.
What carries the argument
The packing-aware gating mechanism that routes malware classification through the output of a parallel packing-detection head.
If this is right
- The model reaches a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279 on 200,000 Windows PE byteplot images.
- It outperforms representative state-of-the-art baselines on all primary malware-detection metrics.
- Packing detection reaches an AUC of 0.9949.
- Frequency-weighted losses combined with stratified sampling over joint class-packing strata mitigate training skew.
Where Pith is reading between the lines
- The same dual-head plus gating pattern could be tested on other binary obfuscation techniques that produce high-entropy images.
- Extending the approach to additional file formats would require only retraining the heads on new byteplot distributions.
- Combining the packing signal with lightweight static features might further stabilize performance on edge cases.
Load-bearing premise
Packing state can be accurately inferred from byteplot images at inference time and used to condition malware predictions without introducing systematic errors on real-world packed samples whose packing labels were not seen during training.
What would settle it
A test set containing packed malware and benign samples that use packers absent from the training distribution shows malware-detection metrics falling below those of non-packing-aware baselines.
Figures
read the original abstract
Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ViPER, a LoRA-adapted ViT-B/14 model with a dual-head architecture for joint malware classification and packing detection on Windows PE byteplot images. It introduces a packing-aware gating mechanism that conditions the malware head on the inferred packing state, employs frequency-weighted losses and stratified sampling over joint malware/packing strata to handle skew, and reports balanced accuracy of 0.8521, ROC-AUC of 0.9260, AUPR of 0.9279, and packing detection AUC of 0.9949 on a 200,000-image dataset, outperforming representative baselines.
Significance. If the packing head generalizes and the gating mechanism operates without systematic bias on real-world inputs, the approach would address a key limitation of visualization-based malware detectors by explicitly modeling packing state rather than treating it as noise. The dual-head design with explicit conditioning is a targeted contribution to handling a prevalent failure mode in the domain.
major comments (2)
- [Abstract] Abstract and evaluation description: the central robustness claim rests on the packing-aware gating mechanism, yet no results are provided for the packing head or gated malware metrics on packers absent from the training strata; the use of stratified sampling over joint class-packing strata presupposes that test-time packing label distributions match training, but no ablation comparing gated predictions against oracle packing state or on held-out packers is reported, leaving open the risk that packing mispredictions route the malware head to an incorrect boundary.
- [Abstract] Abstract: the reported outperformance (balanced accuracy 0.8521, ROC-AUC 0.9260) is presented without any information on baseline implementations, hyperparameter matching, or statistical significance testing, which is load-bearing for the claim that the dual-head and gating design drives the gains rather than implementation differences.
minor comments (2)
- The abstract refers to 'representative state-of-the-art baselines' without naming the methods or citing their sources; the main text should explicitly list and reference them.
- Details on the exact LoRA rank, scaling factor, and loss weighting coefficients are mentioned as free parameters but not reported numerically; these should be included for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on the evaluation setup and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation description: the central robustness claim rests on the packing-aware gating mechanism, yet no results are provided for the packing head or gated malware metrics on packers absent from the training strata; the use of stratified sampling over joint class-packing strata presupposes that test-time packing label distributions match training, but no ablation comparing gated predictions against oracle packing state or on held-out packers is reported, leaving open the risk that packing mispredictions route the malware head to an incorrect boundary.
Authors: We agree that the reported results rely on test data drawn from the same joint malware-packing distribution as training via stratified sampling, and we do not provide separate metrics for the packing head or gated malware classification on packers entirely absent from the training strata. The packing head reaches 0.9949 AUC within the observed packers, and the gating is trained end-to-end to condition the malware head. No oracle-gating or held-out-packer ablations appear in the manuscript because the primary focus is the joint supervised framework under realistic skew rather than explicit OOD packer evaluation. This is a genuine limitation for claims of robustness to novel packers. We will add an explicit discussion of this scope limitation in the revised manuscript and, resources permitting, include a small held-out packer experiment. revision: partial
-
Referee: [Abstract] Abstract: the reported outperformance (balanced accuracy 0.8521, ROC-AUC 0.9260) is presented without any information on baseline implementations, hyperparameter matching, or statistical significance testing, which is load-bearing for the claim that the dual-head and gating design drives the gains rather than implementation differences.
Authors: The abstract is a concise summary; the full manuscript (Section 4 and Appendix) specifies that baselines were reimplemented from official repositories or papers, hyperparameters were matched to the originals where possible, and all metrics are reported as means over five random seeds with standard deviations. Statistical significance between ViPER and baselines was evaluated with paired t-tests on the per-seed scores. We will revise the abstract to include a brief parenthetical reference to these evaluation details or move the key implementation notes earlier in the evaluation section for clarity. revision: yes
Circularity Check
No circularity: empirical ML results with no self-referential derivations
full rationale
The paper describes a standard dual-head ViT model with gating and weighted losses, then reports held-out test metrics (balanced accuracy 0.8521, AUCs 0.9260/0.9279/0.9949) on 200k byteplot images. No equations, predictions, or first-principles claims reduce by construction to fitted parameters on the same data; the evaluation is a conventional train/test split with no load-bearing self-citation or ansatz smuggling. The skeptic concern about unseen packers is an external generalization issue, not a circularity in the reported derivation.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank and scaling
- loss weighting coefficients
axioms (2)
- domain assumption Byteplot visualization preserves discriminative structural patterns for both malware classification and packing detection
- domain assumption Packing state labels are available or accurately obtainable for the training distribution
Reference graph
Works this paper leans on
-
[1]
Ashawa, M., Owoh, N., Hosseinzadeh, S., Osamor, J.,
https://doi.org/10.3390/s25154581. Ashawa, M., Owoh, N., Hosseinzadeh, S., Osamor, J.,
-
[2]
https://doi.org/10.3390/electronics13204081. Bavishi, S., Narayanan, A.,
-
[3]
arXiv preprint arXiv:2409.19461
Accelerating malware classification: A vision transformer solution. arXiv preprint arXiv:2409.19461. https://doi.org/10.48550/arXiv.2409.19461. Bhodia,N.,Prajapati,P.,DiTroia,F.,Stamp,M.,2019.Transferlearningfor image-based malware classification. arXiv preprint arXiv:1903.11551. https://doi.org/10.48550/arXiv.1903.11551. Biondi, F., Enescu, M.A., Given-W...
-
[4]
Tutorial: An overview of malware detection and evasion techniques. In: Proc. Int. Symp. Leveraging Applica- tions of Formal Methods (ISoLA), Limassol, Cyprus. pp. 235–266. https://doi.org/10.1007/978-3-030-03418-4_34. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.,
-
[5]
Emerging properties in self-supervised vision trans- formers. In: Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV). pp. 9650–9660. https://doi.org/10.1109/ICCV48922.2021.00951. Caruana, R.,
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Multitask learning.Mach. Learn.28 (1), 41–75. https://doi.org/10.1023/A:1007379606734. Chawla,N.V.,Bowyer,K.W.,Hall,L.O.,Kegelmeyer,W.P.,2002.SMOTE: Synthetic minority over-sampling technique.J. Artif. Intell. Res.16, 321–357. https://doi.org/10.1613/jair.953. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1023/a:1007379606734 2002
-
[7]
The foundations of cost-sensitive learning. In: Proc. 17th Int. Joint Conf. Artificial Intelligence (IJCAI), Seattle, WA, USA. pp. 973–978. Gibert,D.,Mateu,C.,Planes,J.,Vicens,R.,2019.Usingconvolutionalneu- ralnetworksforclassificationofmalwarerepresentedasimages.J. Com- put. Virol. Hacking Tech.15(1),15–28.https://doi.org/10.1007/s11416- 018-0323-0. He, ...
-
[8]
Deep residual learning for image recognition. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. pp. 770–778. https://doi.org/10.1109/CVPR.2016.90. Horšík, J.,
-
[9]
GitHub repository.https://github.com/ horsicq/Detect-It-Easy(accessed 1 January 2024)
Detect-It-Easy. GitHub repository.https://github.com/ horsicq/Detect-It-Easy(accessed 1 January 2024). Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.,
2024
-
[10]
Searching for MobileNetV3. In: Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Korea. pp. 1314–1324. https://doi.org/10.1109/ICCV.2019.00140. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,
-
[11]
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-rank adaptation of large lan- guage models. In: Proc. Int. Conf. Learning Representations (ICLR). https://doi.org/10.48550/arXiv.2106.09685. Huang, W., Stokes, J.W.,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685
-
[12]
MTNet: A multi-task neural network for dynamic malware classification. In: Proc. 13th Int. Conf. Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA). pp. 399–418. https://doi.org/10.1007/978-3-319-40667-1_20. Ki,Y.,Kim,E.,Kim,H.K.,2015.Anovelapproachtodetectmalwarebased onAPIcallsequenceanalysis.Int. J. Distrib. Sens. Netw.11(6),6591...
-
[13]
SGDR: Stochastic Gradient Descent with Warm Restarts
SGDR: Stochastic gradient descent with warm restarts. In: Proc. Int. Conf. Learning Representations (ICLR). https://doi.org/10.48550/arXiv.1608.03983. Loshchilov, I., Hutter, F.,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1608.03983
-
[14]
Decoupled Weight Decay Regularization
Decoupled weight decay regular- ization. In: Proc. Int. Conf. Learning Representations (ICLR). https://doi.org/10.48550/arXiv.1711.05101. Lu, Z., Tu, S., Li, Z.,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.05101
-
[15]
Malware image classification based on lightweightvisiontransformerandprogressivefocalloss.In:Proc.2025 15th Int. Conf. Communication and Network Security (ICCNS). pp. 217–222. https://doi.org/10.1145/3789456.3789463. Lyda, R., Hamrock, J.,
-
[16]
https://doi.org/10.1109/MSP.2007.48
Using entropy analysis to find en- crypted and packed malware.IEEE Security Privacy5 (2), 40–45. https://doi.org/10.1109/MSP.2007.48. Masab, M., Ahmad, K., Hussain, M., Khan, M.S.,
-
[17]
Malware im- age classification using global context vision transformers for infor- mation security.ICCK Trans. Inf. Security Cryptography2 (1), 1–15. https://doi.org/10.62762/TISC.2025.775760. Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S.,
-
[18]
Malware images: Visualization and automatic classification. In: Proc. 8th Int. Symp.VisualizationforCyberSecurity(VizSec),Pittsburgh,PA,USA. pp. 1–7. https://doi.org/10.1145/2016904.2016908. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., H...
-
[19]
DINOv2: Learning Robust Visual Features without Supervision
DINOv2: Learning robust visualfeatureswithoutsupervision.Trans. Machine Learning Research. https://doi.org/10.48550/arXiv.2304.07193. Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193
-
[20]
TESSERACT: Eliminating experimental bias in malware classification across space and time. In: Proc. USENIX Security Symp., Santa Clara, CA, USA. pp. 729–746. https://doi.org/10.48550/arXiv.1807.07838. Ugarte-Pedrero, X., Balzarotti, D., Santos, I., Bringas, P.G.,
-
[21]
SoK: Deep packer inspection: A longitudinal study of the complexity of run- time packers. In: Proc. IEEE Symp. Security and Privacy (S&P), San Jose, CA, USA. pp. 659–673. https://doi.org/10.1109/SP.2015.48. Vasan, D., Alazab, M., Wassan, S., Safaei, B., Zheng, Q.,
-
[22]
IM- CFN: Image-based malware classification using fine-tuned convolu- tional neural network architecture.Comput. Networks171, 107138. Qaiser et al.:Preprint submitted to ElsevierPage 11 of 12 ViPER: Vision-based Packing-Aware Encoder https://doi.org/10.1016/j.comnet.2020.107138. Yan,J.,Qi,Y.,Rao,Q.,2019.Detectingmalwarewithanensemblemethod based on deep n...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.