pith. sign in

arxiv: 2606.01452 · v1 · pith:73EWKHBSnew · submitted 2026-05-31 · 💻 cs.CR

NetVAD: Foundation-Model Representation Learning for Identifier-Free Unsupervised Intrusion Detection

Pith reviewed 2026-06-28 16:31 UTC · model grok-4.3

classification 💻 cs.CR
keywords unsupervised intrusion detectionnetwork foundation modelsvariational autoencoderbenign traffic modelingzero-day attacksToN-IoTIoT-23flow-based detection
0
0 comments X

The pith

A frozen foundation model representation inside an identifier-free VAE detects network attacks at 98% micro F1 when trained only on benign traffic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NetVAD to project representations from a frozen network foundation model into a task-specific latent space inside a variational autoencoder trained exclusively on benign traffic. This setup produces anomaly scores via reconstruction loss without using any packet or flow identifiers. A sympathetic reader would care because the method targets zero-day detection in production networks where labeled attack data is unavailable and identifiers may be stripped. Evaluation on ToN-IoT yields 98% micro F1 and 96% macro F1 at operational false-positive rates, with transparent per-class results showing strong botnet detection but weaker single-packet reconnaissance performance. Ablations establish that large-scale pre-training is required to avoid degradation and that specialized decoders are needed to model the benign manifold tightly enough for reliable separation.

Core claim

NetVAD is a strictly identifier-free variational autoencoder that takes representations from a frozen Foundation Model, projects them into a task-specific latent space, and is trained solely on benign traffic. On ToN-IoT this produces a 98% Micro F1-score and 96% Macro F1-score at an operational false positive rate while reporting results for every attack class; the model reaches 99.6% F1 on Okiru botnet traffic yet shows limitations on single-packet reconnaissance. The architecture relies on the foundation-model representations to encode sufficient structure of the benign manifold so that attack traffic yields measurably higher reconstruction loss.

What carries the argument

The identifier-free VAE decoder that projects frozen foundation-model network representations into a task-specific latent space and scores anomalies by reconstruction loss.

If this is right

  • Performance holds across multiple attack classes when foundation-model representations replace hand-crafted features.
  • Specialized decoder architectures are required to model the complex benign manifold precisely enough for attack separation.
  • Large-scale pre-training of the backbone is essential; removing it causes measurable performance drop.
  • Flow-based foundation models remain limited on single-packet reconnaissance events even with the VAE head.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-representation approach could be tested on other security tasks that currently require task-specific labeled data.
  • Extending the decoder to incorporate packet-level features might close the gap on reconnaissance detection without retraining the backbone.
  • Deployment in networks that drop identifiers would benefit directly from the identifier-free design.
  • The per-class transparency requirement suggests future unsupervised detectors should report attack-type breakdowns rather than aggregate scores only.

Load-bearing premise

The representations produced by the frozen foundation model contain sufficient information about the benign traffic manifold that a task-specific VAE decoder can reliably assign higher reconstruction loss to attack traffic across all classes.

What would settle it

Measuring reconstruction loss distributions on a held-out set where attack traffic produces losses indistinguishable from benign traffic would falsify the separation claim.

Figures

Figures reproduced from arXiv: 2606.01452 by Darren F\"urst, Patrick Levi, Sebastian Steindl.

Figure 1
Figure 1. Figure 1: NetVAD Architecture III. METHODS A. NetVAD Architecture In this work, we treat the FM as a fixed representation backbone and learn a variational adaptation module that oper￾ates on the FM’s input token embedding space. Specifically, we extract the output of the FM’s embedding layer (i.e., the initial token-level representation before contextual encoding) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Detecting zero-day exploits in production networks requires robust Intrusion Detection Systems (IDS). However, current unsupervised models struggle to match the performance of supervised classifiers, which are trained for specific attacks only. To bridge this gap, we leverage the emerging capabilities of Network Foundation Models. We propose \textit{NetVAD}, a strictly identifier-free Variational Autoencoder that projects representations from a frozen Foundation Model into a task-specific latent space, trained solely on benign traffic. Evaluated on ToN-IoT and IoT-23, NetVAD achieves highly competitive unsupervised performance. On ToN-IoT, it achieves a 98% Micro F1-score and a 96% Macro F1-score at an operational false positive rate. Unlike prior work, we show the model's performance transparently for all attack-classes of the datasets. While the architecture excels at discerning complex botnet behaviour (99.6% F1 on Okiru), our evaluation reveals limitations of flow-based Foundation Models in detecting single-packet reconnaissance events. Finally, a comprehensive ablation study confirms that while large-scale pre-training is essential to prevent performance degrading, specialised decoder architectures are necessary to precisely model the complex benign manifold, ensuring attacks are caught more reliably, due to a higher reconstruction loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes NetVAD, a strictly identifier-free unsupervised IDS that feeds frozen representations from a network foundation model into a task-specific VAE decoder trained exclusively on benign traffic. It reports competitive performance on ToN-IoT (98% micro F1, 96% macro F1 at operational FPR) and IoT-23, provides per-class F1 scores (strong on botnets like Okiru at 99.6%, weaker on single-packet reconnaissance), and includes an ablation study claiming both large-scale pre-training and specialized decoder architectures are required.

Significance. If the central performance claims hold under rigorous evaluation, the work would meaningfully advance unsupervised network anomaly detection by showing how foundation-model representations can enable a VAE to reliably separate attacks via reconstruction error without identifiers or attack-specific supervision. The explicit per-class transparency and acknowledgment of limitations on reconnaissance flows are positive; the ablation on pre-training and decoder design adds useful evidence if properly controlled.

major comments (2)
  1. [Experiments / Evaluation] Evaluation protocol (Experiments section): the abstract and reported F1 scores (98% micro / 96% macro on ToN-IoT) are presented without any description of dataset splits, number of independent runs, error bars, or statistical tests. This absence directly undermines verification of the central performance claim and the ablation conclusions.
  2. [§4 (or equivalent evaluation subsection)] Threshold selection and operational FPR: the paper states results 'at an operational false positive rate' but provides no concrete procedure for choosing or validating this threshold on held-out benign data, which is load-bearing for the reported F1 numbers and for claims of practical utility.
minor comments (2)
  1. [Abstract / Introduction] The abstract claims the model is 'strictly identifier-free' yet does not explicitly contrast this with prior work that may have used flow identifiers; a short clarification in the introduction would help.
  2. [Method] Notation for the VAE latent space and reconstruction loss could be introduced earlier with a small diagram to aid readers unfamiliar with the foundation-model + VAE pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation protocol and threshold selection. We address each major comment below and will revise the manuscript accordingly to enhance reproducibility and clarity.

read point-by-point responses
  1. Referee: [Experiments / Evaluation] Evaluation protocol (Experiments section): the abstract and reported F1 scores (98% micro / 96% macro on ToN-IoT) are presented without any description of dataset splits, number of independent runs, error bars, or statistical tests. This absence directly undermines verification of the central performance claim and the ablation conclusions.

    Authors: We agree that the Experiments section lacks sufficient detail on dataset splits, the number of independent runs, error bars, and statistical tests, which is necessary for full verification of the reported F1 scores and ablation results. In the revised manuscript, we will expand the evaluation protocol description to specify the exact train/test splits (benign-only training), the number of independent runs (e.g., 5 runs with reported means and standard deviations), inclusion of error bars on all metrics, and any statistical tests performed. These additions will directly address the concern. revision: yes

  2. Referee: [§4 (or equivalent evaluation subsection)] Threshold selection and operational FPR: the paper states results 'at an operational false positive rate' but provides no concrete procedure for choosing or validating this threshold on held-out benign data, which is load-bearing for the reported F1 numbers and for claims of practical utility.

    Authors: We acknowledge that the specific procedure for selecting and validating the operational threshold on held-out benign data was not detailed, which is important for interpreting the F1 scores and practical claims. We will revise the evaluation subsection to explicitly describe the threshold selection method, including how a target FPR is achieved and validated using held-out benign traffic (e.g., via a validation split to set the threshold corresponding to a specific FPR level). This will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a standard unsupervised VAE trained solely on benign traffic using frozen foundation-model representations, with performance evaluated on separate attack classes in ToN-IoT and IoT-23. No equations, fitted parameters, or self-citations are shown that reduce the reported F1 scores or reconstruction losses to definitions by construction. The central claim rests on external pre-trained models and a conventional reconstruction objective, with explicit ablations and per-class results that remain independently falsifiable; the derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the standard VAE assumption that higher reconstruction error indicates anomalies.

pith-pipeline@v0.9.1-grok · 5758 in / 1133 out tokens · 48025 ms · 2026-06-28T16:31:52.273344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 1 canonical work pages

  1. [1]

    netfound: Foundation model for network security,

    S. Guthula, R. Beltiukov, N. Battula, W. Guo, A. Gupta, and I. Monga, “netfound: Foundation model for network security,” 2025. [Online]. Available: https://arxiv.org/abs/2310.17025

  2. [2]

    Lens: A knowledge-guided foundation model for network traffic,

    X. Li, C. Qian, Q. Wang, J. Kong, Y . Wang, Z. Yao, B. Ji, L. Cheng, G. Zhou, and H. Shao, “Lens: A knowledge-guided foundation model for network traffic,”arXiv e-prints, pp. arXiv–2402, 2024

  3. [3]

    Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification,

    X. Lin, G. Xiong, G. Gou, Z. Li, J. Shi, and J. Yu, “Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification,” inProceedings of the ACM Web Conference 2022, 2022, pp. 633–642

  4. [4]

    Robust iot security using isolation forest and one class svm algorithms,

    A. Zahoor, W. Abbasi, M. Z. Babar, and A. Aljohani, “Robust iot security using isolation forest and one class svm algorithms,”Scientific Reports, vol. 15, no. 1, p. 36586, 2025

  5. [5]

    Local intrinsic dimensionality of iot networks for unsupervised intrusion detection,

    M. Gorbett, H. Shirazi, and I. Ray, “Local intrinsic dimensionality of iot networks for unsupervised intrusion detection,” inIFIP Annual Conference on Data and Applications Security and Privacy. Springer, 2022, pp. 143–161

  6. [6]

    TON IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems,

    A. Alsaedi, N. Moustafa, Z. Tari, A. Mahmood, and A. Anwar, “TON IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems,”IEEE Access, vol. 8, pp. 165 130–165 150, 2020

  7. [7]

    Shortcut learning in deep neural networks,

    R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,”Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020

  8. [8]

    One-class intrusion detection with dynamic graphs,

    A. Liuliakov, A. Schulz, L. Hermes, and B. Hammer, “One-class intrusion detection with dynamic graphs,” inInternational Conference on Artificial Neural Networks. Springer, 2023, pp. 537–549

  9. [9]

    Towards model generalization for intrusion detection: Unsupervised machine learning techniques,

    M. Verkerken, L. D’hooge, T. Wauters, B. V olckaert, and F. De Turck, “Towards model generalization for intrusion detection: Unsupervised machine learning techniques,”Journal of Network and Systems Management, vol. 30, no. 1, p. 12, 2022

  10. [10]

    Netgpt: Gener- ative pretrained transformer for network traffic,

    X. Meng, C. Lin, Y . Wang, and Y . Zhang, “Netgpt: Gener- ative pretrained transformer for network traffic,”arXiv preprint arXiv:2304.09513, 2023

  11. [11]

    Mm4flow: A pre-trained multi-modal model for versatile network traffic analysis,

    L. Yang, L. Liu, J. Huang, Z. Liu, S. Liang, S. Fu, and Y . Wang, “Mm4flow: A pre-trained multi-modal model for versatile network traffic analysis,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, 2025, pp. 1664–1678

  12. [12]

    Traffic-moe: A sparse foundation model for network traffic analysis,

    J. Zhou, C. Sun, M. Shen, S. Yu, and Q. Xuan, “Traffic-moe: A sparse foundation model for network traffic analysis,”arXiv preprint arXiv:2601.00357, 2026

  13. [13]

    Layer normalization,

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXiv preprint arXiv:1607.06450, 2016

  14. [14]

    U-net: Convolutional net- works for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional net- works for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

  15. [15]

    Mobilenetv2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520

  16. [16]

    Wide residual networks,

    S. Zagoruyko and N. Komodakis, “Wide residual networks,”arXiv preprint arXiv:1605.07146, 2016

  17. [17]

    Predtrad–prediction- based transformer for anomaly detection in multivariate time series data,

    J. Schuster, A. W¨olfel, F. Brunner, and C. Bergler, “Predtrad–prediction- based transformer for anomaly detection in multivariate time series data,” inProc. Interspeech 2025, 2025, pp. 3873–3877

  18. [18]

    Cyclical annealing schedule: A simple approach to mitigating KL vanishing,

    H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin, “Cyclical annealing schedule: A simple approach to mitigating KL vanishing,”CoRR, vol. abs/1903.10145, 2019. [Online]. Available: http://arxiv.org/abs/1903.10145

  19. [19]

    IoT-23: A labeled dataset with malicious and benign IoT network traffic,

    S. Garcia, A. Parmisano, and M. J. Erquiaga, “IoT-23: A labeled dataset with malicious and benign IoT network traffic,” Jan. 2020, version 1.0.0. [Online]. Available: https://doi.org/10.5281/zenodo.4743746 APPENDIX: OCSVM EVALUATION ANDSCALABILITY To further contextualise our baseline selection, we conduc- ted secondary experiments using a One-Class Suppo...

  20. [20]

    Interestingly, when the OCSVM was supplied with the Foundation Model embeddings, performance improved to a 61.2 % Macro F1 on ToN-IoT and 33.9 % on IoT-

    This indicates that without network shortcut identifiers (e.g., IP-addresses), the overlapping distributions of raw flow statistics hinder the ability to establish a clear decision boundary. Interestingly, when the OCSVM was supplied with the Foundation Model embeddings, performance improved to a 61.2 % Macro F1 on ToN-IoT and 33.9 % on IoT-

  21. [21]

    However, because the overall performance remained substantially lower than both the Isolation Forest and NetV AD, the Isolation Forest was used as a primary baseline for our study

    This supports the notion that the Foundation Model effectively clusters behavioural representations in the latent space. However, because the overall performance remained substantially lower than both the Isolation Forest and NetV AD, the Isolation Forest was used as a primary baseline for our study