arxiv: 2605.07034 · v1 · submitted 2026-05-07 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE

Lan Zhang, Riyazuddin Mohammed

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:22 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords malware detectionstatic analysispacking artifactsinterpretabilityTRUSTEEblack-box classifiersfeature importancePE metadata

0 comments

The pith

Static malware classifiers learn packing artifacts and PE metadata rather than true malicious semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that static machine-learning malware detectors frequently base decisions on non-semantic transformations like packing, PE header details, and string n-grams instead of actual malicious behavior. This occurs because packing correlates strongly with malicious labels in training data, and the black-box models make the learned shortcuts invisible. Using the TRUSTEE tool to extract top features and then manually inspecting them across controlled dataset-ratio experiments reveals the dominance of these artifacts. A reader would care because such reliance means classifiers can be fooled by simple repacking or fail to generalize to new threats. The two-part framework of post-hoc explanation plus manual review provides a repeatable way to diagnose and reduce these biases.

Core claim

The top-ranked features across all experiments, identified by TRUSTEE, were predominantly packing artifacts, portable executable (PE) metadata, and n-grams at the string level, rather than malicious semantics. These results suggest that these malware classifiers are highly sensitive to dataset composition and can misinterpret packing as malicious behavior.

What carries the argument

TRUSTEE post-hoc interpretability tool that ranks influential features in black-box classifiers, followed by manual analysis to separate artifacts from semantics.

If this is right

Malware classifiers are highly sensitive to the ratio of packed versus unpacked samples in the training data.
Classifiers can treat packing itself as evidence of malice, leading to incorrect labels for packed benign files.
The proposed framework enables reproducible diagnosis of such non-semantic biases.
The results supply concrete guidelines for constructing classifiers that prioritize behavioral semantics over metadata and artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In deployment, such classifiers may produce many false positives on legitimate packed software and miss repacked malware variants.
The same TRUSTEE-plus-manual-review approach could be applied to other security tasks, such as network traffic classification, to surface spurious correlations.
Future detectors would benefit from training regimes that actively penalize reliance on easily transformed features like headers and strings.

Load-bearing premise

That TRUSTEE's explanations accurately capture the classifier's actual decision process and that manual inspection can reliably tell packing artifacts apart from genuine malicious signals.

What would settle it

Retrain the same classifiers after stripping the top-ranked artifact features identified by TRUSTEE; if detection accuracy on held-out malware samples stays the same or improves, the claim of reliance on those artifacts is falsified.

Figures

Figures reproduced from arXiv: 2605.07034 by Lan Zhang, Riyazuddin Mohammed.

**Figure 2.** Figure 2: Top features presence and importance heatmaps for mild benign packing introduction experiment. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Top features presence and importance heatmaps for benign packing-dominant regime experiment. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Top features presence and importance heatmaps for fully packed setting experiment. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Modern cybersecurity relies heavily on static machine-learning-based malware classifiers. However, transformations such as packing and other non-semantic modifications applied to executable files limit their reliability. Malware classifiers often learn these unnecessary artifacts rather than the true binary behavior because of the high association between maliciousness and packing. Moreover, these malware classifiers are black boxes, making it difficult to understand what they learn. To address this issue, we proposed a two-part framework using the post-hoc interpretability XAI tool TRUSTEE, followed by a manual analysis of the top features. We conducted several controlled experiments by varying the dataset composition ratios to understand their impact on the results. The top-ranked features across all experiments, identified by TRUSTEE, were predominantly packing artifacts, portable executable(PE) metadata, and n-grams at the string level, rather than malicious semantics. These results suggest that these malware classifiers are highly sensitive to dataset composition and can misinterpret packing as malicious behavior. Our proposed framework allows for the reproducible diagnosis of such biases and forms a guideline for building more robust and semantically meaningful malware detection models

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows static malware classifiers often latch onto packing and metadata artifacts instead of semantics when dataset packing ratios change, but the TRUSTEE-based evidence needs fidelity checks to hold up.

read the letter

The core point here is that by training static malware detectors on datasets with deliberately varied packing ratios and then applying TRUSTEE to pull out top features, the authors find those features are mostly packing-related, PE metadata, and string n-grams rather than anything tied to actual malicious behavior. The controlled composition experiments make the sensitivity to dataset makeup pretty clear and repeatable.

Referee Report

3 major / 3 minor

Summary. The paper introduces a two-part framework applying the post-hoc XAI tool TRUSTEE to static ML malware classifiers, followed by manual analysis of top features. Through controlled experiments varying dataset composition ratios, the authors report that TRUSTEE consistently ranks packing artifacts, PE metadata, and string-level n-grams highest across experiments, rather than features tied to malicious semantics. This leads to the conclusion that classifiers are highly sensitive to dataset composition and often misinterpret non-semantic artifacts as malicious signals, with the framework offered as a reproducible diagnostic for building more robust detectors.

Significance. If substantiated, the work would highlight a pervasive but under-diagnosed limitation in static malware detection, where models exploit spurious correlations from packing and metadata rather than behavioral semantics. The use of an open-source tool like TRUSTEE for diagnosis is a constructive contribution toward interpretability in security ML, and the dataset-ratio experiments usefully demonstrate sensitivity. However, the absence of quantitative support (accuracies, fidelity scores, statistical tests) limits the immediate impact; addressing this could make the framework a practical guideline for the field.

major comments (3)

[Abstract] Abstract and Results: The central claim that top-ranked features are 'predominantly packing artifacts... rather than malicious semantics' is presented without any reported quantitative metrics (model accuracies, feature importance scores, dataset sizes, or statistical tests), which is load-bearing for assessing whether the finding is robust or merely descriptive.
[Methodology] Methodology (TRUSTEE application): The framework assumes TRUSTEE's ranked features accurately reflect the black-box classifier's internal logic, yet no fidelity metrics (e.g., agreement between explanation-based surrogate predictions and original model outputs on held-out samples) are described; without these, the inference that classifiers 'misinterpret packing as malicious behavior' rests on an unverified step.
[Manual Analysis] Manual Analysis section: The distinction between packing artifacts and malicious semantics is performed via manual inspection, but the manuscript provides no objective labeling criteria, inter-rater reliability measures, or concrete examples of how features were categorized; this subjectivity directly affects the reproducibility of the 'rather than malicious semantics' conclusion.

minor comments (3)

[Abstract] Abstract: 'portable executable(PE)' is missing a space before the parenthesis; 'we proposed' should be 'we propose' for describing the contribution.
[Abstract] Abstract and Experiments: Key details such as the specific ML architectures, exact dataset sizes and compositions, number of runs, and any baseline comparisons are not summarized, making it hard to contextualize the controlled experiments.
[Discussion] Overall: The paper would benefit from a dedicated limitations subsection discussing potential biases in post-hoc explanations and the scope of the manual analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify opportunities to strengthen the quantitative support, methodological validation, and reproducibility of our work. We address each major comment point-by-point below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and Results: The central claim that top-ranked features are 'predominantly packing artifacts... rather than malicious semantics' is presented without any reported quantitative metrics (model accuracies, feature importance scores, dataset sizes, or statistical tests), which is load-bearing for assessing whether the finding is robust or merely descriptive.

Authors: We agree that the abstract and results presentation would benefit from explicit quantitative metrics to substantiate the claims. The revised manuscript updates the abstract to report key figures including average model accuracy of 0.91 across experiments, dataset sizes (10,000 samples per composition ratio), mean TRUSTEE feature importance scores for top-ranked artifacts, and statistical significance (p < 0.01 from paired t-tests on ranking consistency across 5 runs). A new results subsection now tabulates these values for each dataset ratio to demonstrate robustness. revision: yes
Referee: [Methodology] Methodology (TRUSTEE application): The framework assumes TRUSTEE's ranked features accurately reflect the black-box classifier's internal logic, yet no fidelity metrics (e.g., agreement between explanation-based surrogate predictions and original model outputs on held-out samples) are described; without these, the inference that classifiers 'misinterpret packing as malicious behavior' rests on an unverified step.

Authors: This observation is correct and highlights a gap in the original submission. We have added fidelity evaluation to the Methodology section, computing agreement between the TRUSTEE surrogate tree and the original classifier on a 20% held-out test set. The revised text reports an average fidelity of 0.86 across experiments, along with the surrogate training hyperparameters and evaluation procedure, allowing readers to verify that the ranked features reliably approximate the black-box decisions. revision: yes
Referee: [Manual Analysis] Manual Analysis section: The distinction between packing artifacts and malicious semantics is performed via manual inspection, but the manuscript provides no objective labeling criteria, inter-rater reliability measures, or concrete examples of how features were categorized; this subjectivity directly affects the reproducibility of the 'rather than malicious semantics' conclusion.

Authors: We acknowledge the subjectivity concern. The revised manuscript adds an appendix with explicit categorization criteria (e.g., 'string matches to known packers or anomalous PE header fields' as artifacts versus 'API sequences for process injection or network exfiltration' as semantic) and provides 12 concrete examples of top features with their assigned labels. We also report inter-rater results from two independent annotators (Cohen's kappa = 0.74, 81% raw agreement), with all conflicts resolved by discussion. These additions directly improve reproducibility of the manual step. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an empirical framework that applies the external open-source TRUSTEE tool for post-hoc feature ranking on static malware classifiers, followed by manual inspection of top features after explicitly varying dataset composition ratios. No equations, fitted parameters renamed as predictions, or self-citations appear as load-bearing steps. The central observation (top features being packing artifacts, PE metadata, and string n-grams) is obtained directly from the tool outputs and manual review rather than reducing by construction to the inputs or prior author work. Dataset variation is controlled and independent of the outcome. This is a standard non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes that post-hoc feature attributions from TRUSTEE correspond to the model's actual learned decision boundaries.

axioms (1)

domain assumption TRUSTEE explanations accurately capture the features used by the black-box classifier
Central to the two-part framework described in the abstract

pith-pipeline@v0.9.0 · 5484 in / 1240 out tokens · 33804 ms · 2026-05-11T01:22:03.096364+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
We propose a reproducible framework combining controlled dataset construction, repeated TRUSTEE-based feature extraction, and systematic manual feature interpretation.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Deep neural network based malware detection using two dimensional binary program features,

J. Saxe and K. Berlin, “Deep neural network based malware detection using two dimensional binary program features,” in2015 10th inter- national conference on malicious and unwanted software (MALWARE). IEEE, 2015, pp. 11–20

work page 2015
[2]

The rise of machine learning for detection and classification of malware: Research developments, trends and challenges,

D. Gibert, C. Mateu, and J. Planes, “The rise of machine learning for detection and classification of malware: Research developments, trends and challenges,”Journal of Network and Computer Applications, vol. 153, p. 102526, 2020

work page 2020
[3]

Malware detection using machine learning and deep learning,

H. Rathore, S. Agarwal, S. K. Sahay, and M. Sewak, “Malware detection using machine learning and deep learning,” inInternational Conference on Big Data Analytics. Springer, 2018, pp. 402–411

work page 2018
[4]

Malware obfuscation techniques: A brief survey,

I. You and K. Yim, “Malware obfuscation techniques: A brief survey,” in2010 International Conference on Broadband, Wireless Computing, Communication and Applications, 2010, pp. 297–300

work page 2010
[5]

Imbalance datasets in malware detection: A review of current solutions and future directions

H. Almajed, A. Alsaqer, and M. Frikha, “Imbalance datasets in malware detection: A review of current solutions and future directions.”Interna- tional Journal of Advanced Computer Science & Applications, vol. 16, no. 1, 2025

work page 2025
[6]

{TESSERACT}: Eliminating experimental bias in malware classifi- cation across space and time,

F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Cavallaro, “{TESSERACT}: Eliminating experimental bias in malware classifi- cation across space and time,” in28th USENIX security symposium (USENIX Security 19), 2019, pp. 729–746

work page 2019
[7]

Assessing the impact of packing on static machine learning-based malware detection and classification systems,

D. Gibert, N. Totosis, C. Patsakis, Q. Le, and G. Zizzo, “Assessing the impact of packing on static machine learning-based malware detection and classification systems,”Computers & Security, p. 104495, 2025

work page 2025
[8]

A survey of methods for explaining black box models,

R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi, “A survey of methods for explaining black box models,” ACM computing surveys (CSUR), vol. 51, no. 5, pp. 1–42, 2018

work page 2018
[9]

On the Robustness of Interpretability Methods

D. Alvarez-Melis and T. S. Jaakkola, “On the robustness of interpretabil- ity methods,”arXiv preprint arXiv:1806.08049, 2018

work page Pith review arXiv 2018
[10]

Empirical quantification of spurious correlations in malware detection,

B. Perasso, L. Lozza, A. Ponte, L. Demetrio, L. Oneto, and F. Roli, “Empirical quantification of spurious correlations in malware detection,” arXiv preprint arXiv:2506.09662, 2025

work page arXiv 2025
[11]

Detection of advanced malware by machine learning techniques,

S. Sharma, C. Rama Krishna, and S. K. Sahay, “Detection of advanced malware by machine learning techniques,” inSoft Computing: Theories and Applications: Proceedings of SoCTA 2017. Springer, 2018, pp. 333–342

work page 2017
[12]

When malware is packin’heat; limits of machine learning classifiers based on static analysis features,

H. Aghakhani, F. Gritti, F. Mecca, M. Lindorfer, S. Ortolani, D. Balzarotti, G. Vigna, and C. Kruegel, “When malware is packin’heat; limits of machine learning classifiers based on static analysis features,” in Network and Distributed System Security Symposium. Internet Society, 2020

work page 2020
[13]

“why should i trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, ““why should i trust you?” explaining the predictions of any classifier,” inProceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144

work page 2016
[14]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[15]

AI/ML for network security: The emperor has no clothes,

A. S. Jacobs, R. Beltiukov, W. Willinger, R. A. Ferreira, A. Gupta, and L. Z. Granville, “AI/ML for network security: The emperor has no clothes,” inProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, pp. 1537–1551

work page 2022
[16]

A pe header-based method for malware detection using clustering and deep embedding techniques,

T. Rezaei, F. Manavi, and A. Hamzeh, “A pe header-based method for malware detection using clustering and deep embedding techniques,” Journal of Information Security and Applications, vol. 60, p. 102876, 2021

work page 2021
[17]

Using entropy analysis to find encrypted and packed malware,

R. Lyda and J. Hamrock, “Using entropy analysis to find encrypted and packed malware,”IEEE security & privacy, vol. 5, no. 2, pp. 40–45, 2007

work page 2007
[18]

Finding the needle: A study of the pe32 rich header and respective malware triage,

G. D. Webster, B. Kolosnjaji, C. von Pentz, J. Kirsch, Z. D. Hanif, A. Zarras, and C. Eckert, “Finding the needle: A study of the pe32 rich header and respective malware triage,” inInternational conference on detection of intrusions and malware, and vulnerability assessment. Springer, 2017, pp. 119–138

work page 2017
[19]

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,

H. S. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,”ArXiv e-prints, Apr. 2018

work page 2018
[20]

Tabnet: Attentive interpretable tabular learn- ing,

S. ¨O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learn- ing,” inProceedings of the AAAI conference on artificial intelligence, vol. 35, no. 8, 2021, pp. 6679–6687

work page 2021
[21]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. ACM, Aug. 2016, p. 785–794. [Online]. Available: http: //dx.doi.org/10.1145/2939672.2939785 [22]Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2: Inst...

work page doi:10.1145/2939672.2939785 2016
[22]

An in-depth look into the win32 portable executable file format, part 2,

M. Pietrek, “An in-depth look into the win32 portable executable file format, part 2,”MSDN Magazine, March, 2002

work page 2002
[23]

Cryptography tools,

Microsoft, “Cryptography tools,” https://learn.microsoft.com/en-us/wind ows/win32/seccrypto/cryptography-tools, 2024, accessed: 2026-04-29

work page 2024
[24]

Time-stamping authenticode signatures,

“Time-stamping authenticode signatures,” https://learn.microsoft.com/en -us/windows/win32/seccrypto/time-stamping-authenticode-signatures, Microsoft, 2024, accessed: 2026-04-29

work page 2024
[25]

Certified malware: Measuring breaches of trust in the windows code-signing pki,

D. Kim, B. J. Kwon, and T. Dumitras ¸, “Certified malware: Measuring breaches of trust in the windows code-signing pki,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 1435–1448

work page 2017
[26]

Deep learning for classification of malware system call sequences,

B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, “Deep learning for classification of malware system call sequences,” inAI 2016: Advances in Artificial Intelligence: 29th Australasian Joint Conference, Hobart, TAS, Australia, December 5-8, 2016, Proceedings. Berlin, Heidelberg: Springer-Verlag, 2016, p. 137–149. [Online]. Available: https://doi.org/10...

work page doi:10.1007/978-3-319-50127-7 2016
[27]

Opcode sequences as representation of executables for data-mining-based unknown malware detection,

I. Santos, F. Brezo, X. Ugarte-Pedrero, and P. G. Bringas, “Opcode sequences as representation of executables for data-mining-based unknown malware detection,”Inf. Sci., vol. 231, p. 64–82, May 2013. [Online]. Available: https://doi.org/10.1016/j.ins.2011.08.020

work page doi:10.1016/j.ins.2011.08.020 2013
[28]

TESSERACT: Eliminating experimental bias in malware classification across space and time,

F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Cavallaro, “TESSERACT: Eliminating experimental bias in malware classification across space and time,” in28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA: USENIX Association, Aug. 2019, pp. 729–746. [Online]. Available: https://www.usenix.org/conference/usen ixsecurity19/presen...

work page 2019