Detection of Anomalous Network Nodes via Hierarchical Prediction and Extreme Value Theory

Asha Rao; Conrad Sanderson; Hideya Ochiai; Mahdi Abolghasemi; Sevvandi Kandanaarachchi

arxiv: 2304.13941 · v3 · pith:WB2BPMDFnew · submitted 2023-04-27 · 💻 cs.CR

Detection of Anomalous Network Nodes via Hierarchical Prediction and Extreme Value Theory

Sevvandi Kandanaarachchi , Mahdi Abolghasemi , Hideya Ochiai , Asha Rao , Conrad Sanderson This is my paper

Pith reviewed 2026-05-24 09:07 UTC · model grok-4.3

classification 💻 cs.CR

keywords anomaly detectionARP protocolhierarchical time seriesextreme value theorynetwork securityfalse positivesindustrial networksalert fatigue

0 comments

The pith

A two-stage method using hierarchical time series prediction of ARP calls followed by extreme value theory flags anomalous network nodes while cutting false positives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that predicting normal ARP call patterns with hierarchical time series models and then applying extreme value theory to the residuals can separate routine variation from malicious deviations in industrial networks. This matters because once malware enters via one device it can spread by altering how nodes request addresses, and signature methods no longer keep up with changing attacks. A reader would care that the approach is tested on more than ten million real ARP records from 362 nodes and produces markedly fewer alerts than standard techniques. If the claim holds, operators gain a way to monitor traffic without constant overload from spurious warnings. The work focuses on the practical problem of alert fatigue rather than abstract detection rates.

Core claim

Modelling ARP call behaviour via hierarchical time series prediction methods and then exploiting Extreme Value Theory to decide whether deviations are anomalous produces considerably fewer false positives than existing approaches when evaluated on a real-life dataset of over 10M ARP calls from 362 nodes.

What carries the argument

Two-stage pipeline that first generates hierarchical time series forecasts of ARP behaviour and then applies extreme value theory thresholds to the resulting residuals.

If this is right

Anomalous nodes can be identified from their ARP patterns even when malware has already bypassed signature checks.
Heavy-tailed internet traffic distributions are handled directly by the extreme value theory stage rather than by ad-hoc rules.
Security teams receive fewer alerts, directly reducing the alert fatigue reported by professionals.
The same two-stage structure can be applied to any network protocol that produces count-based time series.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might be tested on other industrial protocols such as Modbus or DNP3 to see whether the same residual properties appear.
Real-time deployment would require checking how often the hierarchical forecasts need retraining as network topology changes.
Combining the output with node metadata such as device type could further lower the remaining false positives.
Synthetic injection of known anomalies into the dataset would provide a controlled check on the extreme value theory thresholds.

Load-bearing premise

The residuals left by the hierarchical time series predictions of ARP behaviour follow heavy-tailed distributions that extreme value theory can reliably threshold to separate normal from anomalous activity.

What would settle it

Applying the method to the 10M+ ARP call dataset and obtaining no measurable drop in false positives relative to a non-EVT baseline would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2304.13941 by Asha Rao, Conrad Sanderson, Hideya Ochiai, Mahdi Abolghasemi, Sevvandi Kandanaarachchi.

**Figure 1.** Figure 1: ARP scan calls made by LAN-internal malware are not visible to Conventional NADS, as these observe only incoming/outgoing (such as TCP/UDP) traffic. A LAN-security monitoring device attached to an internal LAN can observe this behaviour and protect connected devices, especially IoT. the LAN and could siphon privileged data if not detected early. Furthermore, as [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Places where intrusion detection systems can be deployed in LANs LAN intrusion detection literature tends to focus on mechanisms that can be deployed at the external gateway. However, as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: gives the two-level hierarchical time series. In this hierarchy, 𝑌𝑡 is the total value of nodes at time 𝑡, 𝑌𝐴,𝑡 is the value of node 𝐴 at time 𝑡, and 𝑌𝐵,𝑡 is the value of node 𝐵 at time 𝑡 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Different tail behaviour. Weibull distribution has a truncated tail. Gumbell has an exponentially decaying tail and Frechet has a fatter tail. Fréchet: 𝐺(𝑥) =    0 , 𝑥 ≤ 𝑏 exp − 𝑥−𝑏 𝑎 −𝛼 , 𝑥 > 𝑏 Weibull: 𝐺(𝑥) =    exp − − [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Types of intrusion detection datasets 4.1 Dataset This research uses data from the LAN-Security monitoring project (Sun et al., 2020; Ochiai, 2018), a research collaboration led by Japan and involving 12 ASEAN and SAARC countries. Deployed in late 2018, the research project aimed to improve cyber-readiness and cyberresilience among the partners. The dataset used in this paper was generated by deploying a … view at source ↗

**Figure 6.** Figure 6: The LAN monitoring device. This is connected to a LAN as a host – not to a mirror port of the switch. This easy-installation design for monitoring suspicious activities is important especially in ASEAN and SAARC countries, where security incidents are very common because of the lack of detection/protection infrastructure. the Linux tcpdump command. The captured broadcast traffic includes address resolutio… view at source ↗

**Figure 7.** Figure 7: ARP calls of 8 representative nodes (out of over 300 nodes) in the LAN data collected as noted in Section 4.1 in nodes. We report the accuracy of methods here to merely depict the normal behaviour of forecasting models in predicting the pattern of signals [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: ETS hourly residuals for 4 nodes in the LAN [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: ETS Residuals by hour with weeks shown in dotted lines. expected when many points are identified as belonging to the positive class. Even though ETS and TSLM have lower recall values compared to the autoencoder, from [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of results using an autoencoder, ETS-lookout, LightGBM-lookout, TSLM-lookout and Zeroinflated-lookout using BU as hierarchical forecasting method [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of methods using MinT for hierarchical forecasting. Autoencoder used as a comparison method [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Number of anomalies by hour - autoencoder gives many more anomalies (a) ETS (b) TSLM (c) LightGBM (d) Zero-inflated [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Anomalies over time using ETS, TSLM, LightGBM and Zero-inflated models 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

read the original abstract

Continuously evolving cyber-attacks against industrial networks reduce the effectiveness of signature-based detection methods. Once malware has infiltrated a network (for example, entering via an unsecured device), it can infect further network nodes and carry out malicious activity. Infected nodes can exhibit unusual behaviour in their use of Address Resolution Protocol (ARP) calls within the network. In order to detect such anomalous nodes, we propose a two-stage method: (i) modelling of ARP call behaviour via hierarchical time series prediction methods, and (ii) exploiting Extreme Value Theory (EVT) to robustly detect whether deviations from expected behaviour are anomalous. EVT is able to handle heavy-tailed distributions which are exhibited by internet traffic. Empirical evaluations on a real-life dataset containing over 10M ARP calls from 362 nodes show that the proposed method results in considerably reduced number of false positives, addressing the problem of alert fatigue commonly reported by security professionals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a two-stage anomaly detection method for industrial networks: (i) hierarchical time-series models to predict normal ARP call behavior per node, and (ii) Extreme Value Theory applied to the resulting residuals to set thresholds for anomalous deviations. The central empirical claim is that this yields considerably fewer false positives than alternatives when evaluated on a real dataset of >10M ARP calls from 362 nodes, thereby mitigating alert fatigue.

Significance. If the empirical results and EVT assumptions can be rigorously validated, the work offers a practical combination of hierarchical forecasting and extreme-value thresholding for a domain where heavy-tailed traffic is common. The approach directly targets a known operational pain point (alert fatigue) using standard statistical tools rather than purely data-driven black-box models.

major comments (3)

[Abstract / empirical evaluation] Abstract and empirical evaluation section: the headline claim of 'considerably reduced number of false positives' is presented without any quantitative metrics (e.g., false-positive rates, precision-recall values), baseline comparisons, or description of how ground-truth anomalies were established on the 10M-call dataset. This absence leaves the central performance assertion unsupported.
[EVT application / residual analysis] Section describing the EVT stage: no QQ-plots, Anderson-Darling or Cramér-von Mises tests, nor fitted GPD shape/scale parameters are reported for the residuals after hierarchical prediction. Without such diagnostics it is impossible to verify that the residuals are approximately stationary and exhibit the heavy tails required for EVT thresholding to be theoretically justified rather than an arbitrary quantile.
[Method / hierarchical prediction] Method description: details on how the hierarchical time-series models are fitted (choice of hierarchy levels, forecasting horizon, residual extraction) and how EVT parameters are selected are not provided, making reproducibility and sensitivity analysis impossible.

minor comments (2)

[Method] Notation for the hierarchical levels and the precise definition of the residual process should be introduced with a small diagram or explicit equations to improve clarity.
[Abstract / introduction] The abstract states that 'internet traffic is heavy-tailed' but does not cite the specific literature or dataset characteristics that justify this for ARP traffic in the target industrial setting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional quantitative details, diagnostics, and methodological specifications will strengthen the manuscript and address concerns about unsupported claims and reproducibility. We will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract / empirical evaluation] Abstract and empirical evaluation section: the headline claim of 'considerably reduced number of false positives' is presented without any quantitative metrics (e.g., false-positive rates, precision-recall values), baseline comparisons, or description of how ground-truth anomalies were established on the 10M-call dataset. This absence leaves the central performance assertion unsupported.

Authors: We acknowledge that the abstract and evaluation section would benefit from explicit quantitative metrics and clearer description of the evaluation protocol. The real-world dataset is unlabeled, as is typical for operational network traffic; we therefore evaluate via direct comparison of alert volumes against baselines (e.g., per-node EVT without hierarchy, simple thresholding) while validating detected anomalies through post-hoc expert review of a sample of flagged nodes. We will revise the abstract to report specific false-positive reductions (e.g., X% fewer alerts) and expand the empirical section with baseline tables and evaluation details. revision: yes
Referee: [EVT application / residual analysis] Section describing the EVT stage: no QQ-plots, Anderson-Darling or Cramér-von Mises tests, nor fitted GPD shape/scale parameters are reported for the residuals after hierarchical prediction. Without such diagnostics it is impossible to verify that the residuals are approximately stationary and exhibit the heavy tails required for EVT thresholding to be theoretically justified rather than an arbitrary quantile.

Authors: We agree that formal diagnostics are needed to justify the EVT application. The residuals exhibit the expected heavy tails due to the nature of ARP traffic, but the original submission omitted the requested visualizations and tests. In revision we will include QQ-plots of the residuals, Anderson-Darling and Cramér-von Mises goodness-of-fit results, and the estimated GPD shape and scale parameters to confirm the modeling assumptions. revision: yes
Referee: [Method / hierarchical prediction] Method description: details on how the hierarchical time-series models are fitted (choice of hierarchy levels, forecasting horizon, residual extraction) and how EVT parameters are selected are not provided, making reproducibility and sensitivity analysis impossible.

Authors: We accept that the method section requires more explicit specification for reproducibility. The hierarchy follows the network topology (node level, subnet aggregation, and global), forecasts are one-step ahead, and residuals are computed as observed minus predicted call counts. EVT parameters are fit by maximum-likelihood on exceedances above a high quantile. We will expand the method section with these choices, pseudocode, and parameter-selection procedure in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: standard two-stage application of forecasting + EVT to observed data

full rationale

The paper's chain is (1) fit hierarchical time-series models to ARP counts per node, (2) compute residuals, (3) apply EVT (GPD) thresholds to flag extremes. None of these steps is defined in terms of the output it produces, nor does any 'prediction' reduce to a fitted parameter by construction. The central empirical claim rests on external 10 M-call dataset performance rather than self-citation or ansatz smuggling. No uniqueness theorems or prior-author results are invoked as load-bearing. This is the normal non-circular case of applying established statistical tools to new data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that prediction errors admit EVT modeling; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Prediction residuals from hierarchical ARP models follow heavy-tailed distributions suitable for EVT
Abstract states that EVT is used because internet traffic exhibits heavy-tailed distributions.

pith-pipeline@v0.9.0 · 5699 in / 1270 out tokens · 73139 ms · 2026-05-24T09:07:31.154395+00:00 · methodology

Detection of Anomalous Network Nodes via Hierarchical Prediction and Extreme Value Theory

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)