pith. sign in

arxiv: 2606.12949 · v1 · pith:XNXCNZRLnew · submitted 2026-06-11 · 💻 cs.CR · cs.CV

ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection

Pith reviewed 2026-06-27 06:30 UTC · model grok-4.3

classification 💻 cs.CR cs.CV
keywords malware detectionbyteplot imagespacking detectionvision transformerWindows PEdual-head architecturegating mechanism
0
0 comments X

The pith

ViPER conditions malware predictions on inferred packing state via a dual-head vision model to handle packed executables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ViPER as a way to make vision-based malware detection work when executables are packed, which creates high-entropy byteplot images that hide the patterns these models normally use. Because packing appears in both benign and malicious software, simply flagging packed files does not solve the problem. ViPER adds a second head to detect packing and then routes the malware decision through a gating step that applies different boundaries depending on the detected packing state. It also uses weighted losses and stratified sampling to manage the uneven distribution of packing labels. A reader would care because this offers a single supervised pipeline that keeps visual detection viable without disassembly or separate packing filters.

Core claim

ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, it employs frequency-weighted losses with stratified sampling over joint class-packing strata.

What carries the argument

The packing-aware gating mechanism that routes malware classification through the output of a parallel packing-detection head.

If this is right

  • The model reaches a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279 on 200,000 Windows PE byteplot images.
  • It outperforms representative state-of-the-art baselines on all primary malware-detection metrics.
  • Packing detection reaches an AUC of 0.9949.
  • Frequency-weighted losses combined with stratified sampling over joint class-packing strata mitigate training skew.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-head plus gating pattern could be tested on other binary obfuscation techniques that produce high-entropy images.
  • Extending the approach to additional file formats would require only retraining the heads on new byteplot distributions.
  • Combining the packing signal with lightweight static features might further stabilize performance on edge cases.

Load-bearing premise

Packing state can be accurately inferred from byteplot images at inference time and used to condition malware predictions without introducing systematic errors on real-world packed samples whose packing labels were not seen during training.

What would settle it

A test set containing packed malware and benign samples that use packers absent from the training distribution shows malware-detection metrics falling below those of non-packing-aware baselines.

Figures

Figures reproduced from arXiv: 2606.12949 by Bisma Tahir, Fatima Qaiser, Muhammad Abid Mughal, Nauman Shamim.

Figure 1
Figure 1. Figure 1: Overall architecture of ViPER. A LoRA-adapted DINOv2 ViT-B/14 backbone produces a shared [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Packing label distribution in the dataset. Left: Malware samples (76.20% packed). Right: Benign samples (73.08% packed). The near-symmetric packing prevalence across both classes confirms that packing state alone is an insufficient discriminator [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC curves for all four ablation configurations on the test set. The area under each curve corresponds to the AUC values reported in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training curves for ViPER (configuration 4). Loss decreases smoothly for both heads across 20 epochs. Validation AUC and balanced accuracy improve consistently through epoch 18 (best checkpoint) before plateauing. Qaiser et al.: Preprint submitted to Elsevier Page 9 of 12 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ViPER, a LoRA-adapted ViT-B/14 model with a dual-head architecture for joint malware classification and packing detection on Windows PE byteplot images. It introduces a packing-aware gating mechanism that conditions the malware head on the inferred packing state, employs frequency-weighted losses and stratified sampling over joint malware/packing strata to handle skew, and reports balanced accuracy of 0.8521, ROC-AUC of 0.9260, AUPR of 0.9279, and packing detection AUC of 0.9949 on a 200,000-image dataset, outperforming representative baselines.

Significance. If the packing head generalizes and the gating mechanism operates without systematic bias on real-world inputs, the approach would address a key limitation of visualization-based malware detectors by explicitly modeling packing state rather than treating it as noise. The dual-head design with explicit conditioning is a targeted contribution to handling a prevalent failure mode in the domain.

major comments (2)
  1. [Abstract] Abstract and evaluation description: the central robustness claim rests on the packing-aware gating mechanism, yet no results are provided for the packing head or gated malware metrics on packers absent from the training strata; the use of stratified sampling over joint class-packing strata presupposes that test-time packing label distributions match training, but no ablation comparing gated predictions against oracle packing state or on held-out packers is reported, leaving open the risk that packing mispredictions route the malware head to an incorrect boundary.
  2. [Abstract] Abstract: the reported outperformance (balanced accuracy 0.8521, ROC-AUC 0.9260) is presented without any information on baseline implementations, hyperparameter matching, or statistical significance testing, which is load-bearing for the claim that the dual-head and gating design drives the gains rather than implementation differences.
minor comments (2)
  1. The abstract refers to 'representative state-of-the-art baselines' without naming the methods or citing their sources; the main text should explicitly list and reference them.
  2. Details on the exact LoRA rank, scaling factor, and loss weighting coefficients are mentioned as free parameters but not reported numerically; these should be included for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on the evaluation setup and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation description: the central robustness claim rests on the packing-aware gating mechanism, yet no results are provided for the packing head or gated malware metrics on packers absent from the training strata; the use of stratified sampling over joint class-packing strata presupposes that test-time packing label distributions match training, but no ablation comparing gated predictions against oracle packing state or on held-out packers is reported, leaving open the risk that packing mispredictions route the malware head to an incorrect boundary.

    Authors: We agree that the reported results rely on test data drawn from the same joint malware-packing distribution as training via stratified sampling, and we do not provide separate metrics for the packing head or gated malware classification on packers entirely absent from the training strata. The packing head reaches 0.9949 AUC within the observed packers, and the gating is trained end-to-end to condition the malware head. No oracle-gating or held-out-packer ablations appear in the manuscript because the primary focus is the joint supervised framework under realistic skew rather than explicit OOD packer evaluation. This is a genuine limitation for claims of robustness to novel packers. We will add an explicit discussion of this scope limitation in the revised manuscript and, resources permitting, include a small held-out packer experiment. revision: partial

  2. Referee: [Abstract] Abstract: the reported outperformance (balanced accuracy 0.8521, ROC-AUC 0.9260) is presented without any information on baseline implementations, hyperparameter matching, or statistical significance testing, which is load-bearing for the claim that the dual-head and gating design drives the gains rather than implementation differences.

    Authors: The abstract is a concise summary; the full manuscript (Section 4 and Appendix) specifies that baselines were reimplemented from official repositories or papers, hyperparameters were matched to the originals where possible, and all metrics are reported as means over five random seeds with standard deviations. Statistical significance between ViPER and baselines was evaluated with paired t-tests on the per-seed scores. We will revise the abstract to include a brief parenthetical reference to these evaluation details or move the key implementation notes earlier in the evaluation section for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML results with no self-referential derivations

full rationale

The paper describes a standard dual-head ViT model with gating and weighted losses, then reports held-out test metrics (balanced accuracy 0.8521, AUCs 0.9260/0.9279/0.9949) on 200k byteplot images. No equations, predictions, or first-principles claims reduce by construction to fitted parameters on the same data; the evaluation is a conventional train/test split with no load-bearing self-citation or ansatz smuggling. The skeptic concern about unseen packers is an external generalization issue, not a circularity in the reported derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that byteplot images retain sufficient structural signal for both malware and packing classification, that packing labels exist or can be obtained for training, and that standard supervised learning assumptions hold for the joint task.

free parameters (2)
  • LoRA rank and scaling
    Hyperparameters controlling the low-rank adaptation of the ViT backbone; chosen to enable efficient fine-tuning.
  • loss weighting coefficients
    Scalars in the frequency-weighted losses that balance the joint malware and packing objectives.
axioms (2)
  • domain assumption Byteplot visualization preserves discriminative structural patterns for both malware classification and packing detection
    Invoked by the entire visualization-based pipeline and required for the dual-head approach to succeed.
  • domain assumption Packing state labels are available or accurately obtainable for the training distribution
    Required for the stratified sampling and frequency-weighted losses described in the abstract.

pith-pipeline@v0.9.1-grok · 5778 in / 1504 out tokens · 29259 ms · 2026-06-27T06:30:07.740230+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 21 canonical work pages · 5 internal anchors

  1. [1]

    Ashawa, M., Owoh, N., Hosseinzadeh, S., Osamor, J.,

    https://doi.org/10.3390/s25154581. Ashawa, M., Owoh, N., Hosseinzadeh, S., Osamor, J.,

  2. [2]

    Bavishi, S., Narayanan, A.,

    https://doi.org/10.3390/electronics13204081. Bavishi, S., Narayanan, A.,

  3. [3]

    arXiv preprint arXiv:2409.19461

    Accelerating malware classification: A vision transformer solution. arXiv preprint arXiv:2409.19461. https://doi.org/10.48550/arXiv.2409.19461. Bhodia,N.,Prajapati,P.,DiTroia,F.,Stamp,M.,2019.Transferlearningfor image-based malware classification. arXiv preprint arXiv:1903.11551. https://doi.org/10.48550/arXiv.1903.11551. Biondi, F., Enescu, M.A., Given-W...

  4. [4]

    In: Proc

    Tutorial: An overview of malware detection and evasion techniques. In: Proc. Int. Symp. Leveraging Applica- tions of Formal Methods (ISoLA), Limassol, Cyprus. pp. 235–266. https://doi.org/10.1007/978-3-030-03418-4_34. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.,

  5. [5]

    In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021

    Emerging properties in self-supervised vision trans- formers. In: Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV). pp. 9650–9660. https://doi.org/10.1109/ICCV48922.2021.00951. Caruana, R.,

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Multitask learning.Mach. Learn.28 (1), 41–75. https://doi.org/10.1023/A:1007379606734. Chawla,N.V.,Bowyer,K.W.,Hall,L.O.,Kegelmeyer,W.P.,2002.SMOTE: Synthetic minority over-sampling technique.J. Artif. Intell. Res.16, 321–357. https://doi.org/10.1613/jair.953. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani...

  7. [7]

    In: Proc

    The foundations of cost-sensitive learning. In: Proc. 17th Int. Joint Conf. Artificial Intelligence (IJCAI), Seattle, WA, USA. pp. 973–978. Gibert,D.,Mateu,C.,Planes,J.,Vicens,R.,2019.Usingconvolutionalneu- ralnetworksforclassificationofmalwarerepresentedasimages.J. Com- put. Virol. Hacking Tech.15(1),15–28.https://doi.org/10.1007/s11416- 018-0323-0. He, ...

  8. [8]

    2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1, doi: 10.1109/CVPR.2016.90

    Deep residual learning for image recognition. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. pp. 770–778. https://doi.org/10.1109/CVPR.2016.90. Horšík, J.,

  9. [9]

    GitHub repository.https://github.com/ horsicq/Detect-It-Easy(accessed 1 January 2024)

    Detect-It-Easy. GitHub repository.https://github.com/ horsicq/Detect-It-Easy(accessed 1 January 2024). Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.,

  10. [10]

    In: Proc

    Searching for MobileNetV3. In: Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Korea. pp. 1314–1324. https://doi.org/10.1109/ICCV.2019.00140. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,

  11. [11]

    LoRA: Low-Rank Adaptation of Large Language Models

    LoRA: Low-rank adaptation of large lan- guage models. In: Proc. Int. Conf. Learning Representations (ICLR). https://doi.org/10.48550/arXiv.2106.09685. Huang, W., Stokes, J.W.,

  12. [12]

    In: Proc

    MTNet: A multi-task neural network for dynamic malware classification. In: Proc. 13th Int. Conf. Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA). pp. 399–418. https://doi.org/10.1007/978-3-319-40667-1_20. Ki,Y.,Kim,E.,Kim,H.K.,2015.Anovelapproachtodetectmalwarebased onAPIcallsequenceanalysis.Int. J. Distrib. Sens. Netw.11(6),6591...

  13. [13]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    SGDR: Stochastic gradient descent with warm restarts. In: Proc. Int. Conf. Learning Representations (ICLR). https://doi.org/10.48550/arXiv.1608.03983. Loshchilov, I., Hutter, F.,

  14. [14]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regular- ization. In: Proc. Int. Conf. Learning Representations (ICLR). https://doi.org/10.48550/arXiv.1711.05101. Lu, Z., Tu, S., Li, Z.,

  15. [15]

    Malware image classification based on lightweightvisiontransformerandprogressivefocalloss.In:Proc.2025 15th Int. Conf. Communication and Network Security (ICCNS). pp. 217–222. https://doi.org/10.1145/3789456.3789463. Lyda, R., Hamrock, J.,

  16. [16]

    https://doi.org/10.1109/MSP.2007.48

    Using entropy analysis to find en- crypted and packed malware.IEEE Security Privacy5 (2), 40–45. https://doi.org/10.1109/MSP.2007.48. Masab, M., Ahmad, K., Hussain, M., Khan, M.S.,

  17. [17]

    Malware im- age classification using global context vision transformers for infor- mation security.ICCK Trans. Inf. Security Cryptography2 (1), 1–15. https://doi.org/10.62762/TISC.2025.775760. Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S.,

  18. [18]

    Nataraj, S

    Malware images: Visualization and automatic classification. In: Proc. 8th Int. Symp.VisualizationforCyberSecurity(VizSec),Pittsburgh,PA,USA. pp. 1–7. https://doi.org/10.1145/2016904.2016908. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., H...

  19. [19]

    DINOv2: Learning Robust Visual Features without Supervision

    DINOv2: Learning robust visualfeatureswithoutsupervision.Trans. Machine Learning Research. https://doi.org/10.48550/arXiv.2304.07193. Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.,

  20. [20]

    In: Proc

    TESSERACT: Eliminating experimental bias in malware classification across space and time. In: Proc. USENIX Security Symp., Santa Clara, CA, USA. pp. 729–746. https://doi.org/10.48550/arXiv.1807.07838. Ugarte-Pedrero, X., Balzarotti, D., Santos, I., Bringas, P.G.,

  21. [21]

    In: Proc

    SoK: Deep packer inspection: A longitudinal study of the complexity of run- time packers. In: Proc. IEEE Symp. Security and Privacy (S&P), San Jose, CA, USA. pp. 659–673. https://doi.org/10.1109/SP.2015.48. Vasan, D., Alazab, M., Wassan, S., Safaei, B., Zheng, Q.,

  22. [22]

    Networks171, 107138

    IM- CFN: Image-based malware classification using fine-tuned convolu- tional neural network architecture.Comput. Networks171, 107138. Qaiser et al.:Preprint submitted to ElsevierPage 11 of 12 ViPER: Vision-based Packing-Aware Encoder https://doi.org/10.1016/j.comnet.2020.107138. Yan,J.,Qi,Y.,Rao,Q.,2019.Detectingmalwarewithanensemblemethod based on deep n...