arxiv: 2604.02490 · v1 · submitted 2026-04-02 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Automated Malware Family Classification using Weighted Hierarchical Ensembles of Large Language Models

Samita Bai , Hamed Jelodar , Tochukwu Emmanuel Nwankwo , Parisa Hamedi , Mohammad Meymani , Roozbeh Razavi-Far , Ali A. Ghorbani

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:20 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords malware classificationlarge language modelsensemble learningzero-shot classificationhierarchical classificationcybersecuritythreat detection

0 comments

The pith

A weighted hierarchical ensemble of large language models classifies malware families without any labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that combining outputs from multiple pretrained large language models in a weighted hierarchical structure allows accurate classification of malware families in a zero-label setting. This would matter because it removes the need for large labeled datasets and supervised training that current methods require, making it more scalable for rapidly evolving threats. The approach uses decision-level aggregation where models are weighted by their macro-F1 performance and organized to first identify broad malicious behaviors before specifying exact families. A sympathetic reader would care as it suggests a way to leverage existing model knowledge for real-world cybersecurity without retraining.

Core claim

The central claim is that a zero-label malware family classification framework can be built by aggregating decision-level predictions from multiple pretrained large language models using empirically derived macro-F1 weights in a hierarchical structure that first resolves coarse-grained malicious behavior and then assigns fine-grained families, thereby enhancing robustness and reducing individual model instability.

What carries the argument

The weighted hierarchical ensemble of LLMs, which aggregates predictions hierarchically from coarse to fine-grained levels using performance-based weights.

If this is right

Scalability improves in open-world scenarios since no labeled data or retraining is needed.
Robustness increases against obfuscation and packing by leveraging complementary model strengths.
Alignment with analyst-style reasoning makes the classifications more interpretable.
Individual model instability is mitigated through the ensemble weighting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might generalize to classifying other types of cyber threats like phishing or intrusion patterns.
Future work could explore dynamic weighting based on input characteristics rather than fixed macro-F1 scores.
Testing on larger and more diverse malware datasets could reveal limits in handling very novel variants.

Load-bearing premise

The assumption that pretrained large language models have enough complementary reasoning strengths for accurate malware classification and that fixed macro-F1 weights will consistently boost ensemble results without needing domain-specific adjustments.

What would settle it

A direct comparison showing that the ensemble's accuracy does not exceed that of the strongest individual LLM on a dataset of newly emerging malware families would disprove the benefit of the hierarchical weighting.

Figures

Figures reproduced from arXiv: 2604.02490 by Ali A. Ghorbani, Hamed Jelodar, Mohammad Meymani, Parisa Hamedi, Roozbeh Razavi-Far, Samita Bai, Tochukwu Emmanuel Nwankwo.

**Figure 1.** Figure 1: A sample from SBAN dataset. 3.1. Problem Formulation We formulate malware family classification as a zero-label, decision-level aggregation problem over pretrained large language models (LLMs). The goal is to assign each malware sample to a single canonical family without relying on supervised training, feature-level learning, dynamic execution, or model fine-tuning. Let X denote the input space of malwa… view at source ↗

**Figure 2.** Figure 2: Zero-shot malware family classification prompt used to elicit decision-level predictions from pretrained large language models. 3.9. Gold Set Construction and Weight Calibration To calibrate model reliabilities, we construct a humanlabeled gold set. The gold-standard dataset consists of 200 malware samples manually labeled by multiple malware analysis experts following a predefined family taxonomy. Inter-… view at source ↗

**Figure 3.** Figure 3: Proposed zero-shot LLM ensemble pipeline for malware family classification. A shared classification prompt is applied to multiple LLMs, followed by normalization, weighted hierarchical ensembling, and gold-based evaluation. verified family labels spanning ten canonical malware families. Each sample was independently classified by multiple LLMs, and all predictions were normalized to the canonical label se… view at source ↗

**Figure 4.** Figure 4: Accuracy comparison of individual LLMs and ensemble strategies on the 200-sample gold standard. The final weighted hierarchical ensemble achieves the highest overall accuracy. the final weighted hierarchical ensemble achieved the best accuracy and competitive Macro-F1. These results show that hierarchical ensembling is effective only when combined with calibrated weighting. Despite dataset limitations, the… view at source ↗

**Figure 5.** Figure 5: Prompt sensitivity (ensemble output): Accuracy and Macro-F1 of FinalLabel across prompts P1–P5. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt sensitivity: Macro-F1 across prompts P1–P5 for each LLM and the aggregated output (FinalLabel). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Malware family classification remains a challenging task in automated malware analysis, particularly in real-world settings characterized by obfuscation, packing, and rapidly evolving threats. Existing machine learning and deep learning approaches typically depend on labeled datasets, handcrafted features, supervised training, or dynamic analysis, which limits their scalability and effectiveness in open-world scenarios. This paper presents a zero-label malware family classification framework based on a weighted hierarchical ensemble of pretrained large language models (LLMs). Rather than relying on feature-level learning or model retraining, the proposed approach aggregates decision-level predictions from multiple LLMs with complementary reasoning strengths. Model outputs are weighted using empirically derived macro-F1 scores and organized hierarchically, first resolving coarse-grained malicious behavior before assigning fine-grained malware families. This structure enhances robustness, reduces individual model instability, and aligns with analyst-style reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a zero-label malware family classification framework using a weighted hierarchical ensemble of pretrained large language models (LLMs). It aggregates decision-level predictions from multiple LLMs with complementary reasoning strengths, weights them using empirically derived macro-F1 scores, and applies a hierarchical coarse-to-fine structure to first resolve broad malicious behaviors before assigning specific families, with the goal of improving robustness and reducing model instability without supervised training or feature engineering.

Significance. If the approach can be shown to operate without any labeled data while delivering the claimed robustness gains, it would represent a notable advance in scalable, open-world malware analysis by leveraging off-the-shelf LLMs in a manner that mimics analyst reasoning and avoids the limitations of traditional supervised or dynamic-analysis methods.

major comments (1)

Abstract: The central claim of a 'zero-label' framework is undermined by the statement that model outputs are 'weighted using empirically derived macro-F1 scores'. Macro-F1 computation inherently requires ground-truth family labels on a validation set, which constitutes supervised calibration and directly contradicts the zero-label premise; this load-bearing inconsistency must be resolved with an explicit description of how weights are obtained without access to labeled data.

minor comments (2)

Abstract: No specific LLMs, datasets, evaluation metrics beyond the weighting reference, or experimental results are mentioned, which prevents assessment of whether the hierarchical ensemble actually improves upon individual models.
The manuscript should clarify the exact definition of the hierarchical levels (coarse behaviors to fine families) and how decision-level aggregation is performed across LLMs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying this important point of clarification regarding the zero-label claim. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The central claim of a 'zero-label' framework is undermined by the statement that model outputs are 'weighted using empirically derived macro-F1 scores'. Macro-F1 computation inherently requires ground-truth family labels on a validation set, which constitutes supervised calibration and directly contradicts the zero-label premise; this load-bearing inconsistency must be resolved with an explicit description of how weights are obtained without access to labeled data.

Authors: We agree that macro-F1 computation requires ground-truth labels and that this creates an apparent inconsistency with the zero-label framing. The weights are derived from a small, fixed validation set of labeled samples used exclusively for this one-time empirical calibration of LLM contributions; the pretrained LLMs themselves undergo no training, fine-tuning, or feature learning on any data. Once weights are fixed, inference on new samples requires no labels. We will revise the abstract, introduction, and methods sections to explicitly describe the weight derivation process and clarify that 'zero-label' refers to the absence of supervised model adaptation or labeled training data for classification, rather than prohibiting any labeled data for calibration. This revision will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and description contain no equations, mathematical derivations, or self-citations that reduce any claimed prediction or result to its inputs by construction. The weighting step is described only as 'empirically derived macro-F1 scores' without any explicit fitting procedure, parameter estimation equations, or reduction showing that outputs equal inputs tautologically. The framework is presented as an empirical aggregation of LLM predictions organized hierarchically; this remains a self-contained methodological description without load-bearing circular steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that pretrained LLMs can perform effective zero-shot malware reasoning and that macro-F1-derived weights will produce a superior ensemble. No invented entities are introduced. The only free parameter is the set of model weights fitted empirically from macro-F1 scores.

free parameters (1)

LLM weights = empirically derived
Weights assigned to each LLM based on empirically derived macro-F1 scores to aggregate predictions.

axioms (1)

domain assumption Pretrained LLMs have complementary reasoning strengths applicable to malware analysis without fine-tuning
Invoked to justify the ensemble aggregation and hierarchical structure in the abstract.

pith-pipeline@v0.9.0 · 5468 in / 1367 out tokens · 149082 ms · 2026-05-13T21:20:11.651460+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Model outputs are weighted using empirically derived macro-F1 scores... a small human-labeled gold set G is employed exclusively for estimating model reliabilities and calibrating the weights {wi}
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

weighted hierarchical ensemble... coarse-grained behavior groups... specificity ranking

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution
cs.CR 2026-05 unverdicted novelty 5.0

LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

Ero˘glu Demirkan, E

doi: 10.1145/1013886.1007518. Ero˘glu Demirkan, E. and Aydos, M. Enhancing malware detection via rgb assembly visualization and hybrid deep learning models.Applied Sciences, 15(13):7163, 2025. doi: 10.3390/app15137163. Espejel, J. L., Ettifouri, E. H., Alassan, M. S. Y ., Chouham, E. M., and Dahhane, W. Gpt-3.5, gpt-4, or bard? evalu- ating llms reasoning...

work page doi:10.1145/1013886.1007518 2025
[2]

Accessed: Nov

URL https://ieeexplore.ieee.org/ abstract/document/5633410/. Accessed: Nov. 03, 2025. Zhang, J., Bu, H., Wen, H., Chen, Y ., Li, L., and Zhu, H. When llms meet cybersecurity: a systematic literature review.Cybersecurity, 8(1):55, 2025. doi: 10.1186/ s42400-025-00361-w. Zhao, Z., Yang, S., and Zhao, D. A new framework for visual classification of multi-cha...

work page doi:10.3390/app13042484 2025
[3]

Despite their efficiency, image-based methods offer limited interpretability and degrade under heavy packing and obfuscation

introduced grayscale malware visualization, while subsequent studies (Saxe & Berlin, 2015; Ero ˘glu Demirkan & Aydos, 2025; Zhao et al., 2023) demonstrated improved performance using deep CNNs and richer visual encodings. Despite their efficiency, image-based methods offer limited interpretability and degrade under heavy packing and obfuscation. 12 Automa...

work page 2015