Descriptor: Multi-Regional Cloud Honeypot Dataset (MURHCAD)

Enrique Feito-Casares; Ismael G\'omez-Talal; Jos\'e-Luis Rojo-\'Alvarez

arxiv: 2601.05813 · v1 · submitted 2026-01-09 · 💻 cs.DB · cs.CR

Descriptor: Multi-Regional Cloud Honeypot Dataset (MURHCAD)

Enrique Feito-Casares , Ismael G\'omez-Talal , Jos\'e-Luis Rojo-\'Alvarez This is my paper

Pith reviewed 2026-05-16 15:39 UTC · model grok-4.3

classification 💻 cs.DB cs.CR

keywords honeynet datasetcyberattack analysiscloud honeypotsthreat intelligenceanomaly detectionSIP Telnet SMBgeolocation metadata

0 comments

The pith

A 72-hour multi-regional honeynet dataset records 132,425 attack events from three honeypots on Azure VMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the MURHCAD dataset as a high-resolution collection of cyberattack events gathered over 72 continuous hours in June 2025. Three honeypot implementations ran on four geographically dispersed Azure virtual machines and captured detailed records that include timestamps, source IPs, geolocations, autonomous system mappings, targeted ports, and protocol classifications. The authors position the dataset as a ready-to-use resource for independent research into global attack patterns, such as daily rush-hour peaks and protocol dominance. By supplying the raw data together with analysis code, the work removes the need for individual teams to build and maintain their own honeynet infrastructure. Descriptive statistics in the paper already show strong skew in source distribution and clear differences in what each honeypot type records.

Core claim

The authors establish that the MURHCAD dataset supplies enriched, standalone records of 132,425 attack events collected from Cowrie, Dionaea, and SentryPeer honeypots deployed across four Azure regions. Each event carries UTC timestamps, source and destination IP details, autonomous system and organization labels, geolocation coordinates, port targets, honeypot identifiers, and derived temporal and protocol features. Statistical summaries reveal that 2,438 unique source IPs from 95 countries produced the events, with three protocols (SIP, Telnet, SMB) accounting for the large majority and clear peaks occurring at 07:00 and 23:00 UTC. Platform-specific capture patterns appear when the same ge

What carries the argument

The MURHCAD dataset, a structured collection of attack events enriched with temporal, geospatial, and protocol metadata from multiple honeypot types running on cloud VMs.

If this is right

Researchers can conduct standalone analyses of global cyberattack behaviors using only the supplied data and code.
Anomaly detection work gains fine-grained temporal resolution and protocol labels for each event.
Protocol-misuse studies can directly use the standardized classifications and port targeting information.
Threat intelligence efforts benefit from the autonomous system, organization, and geolocation mappings already attached to each record.
Defensive policy design can reference the observed rush-hour peaks and maintenance-induced gaps to schedule monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Longer collection runs using the same setup could test whether the reported daily peaks persist or shift seasonally.
Side-by-side comparison with honeynet data from other cloud providers would reveal whether Azure-specific biases are large.
The platform differences noted between the three honeypots suggest that future datasets should deliberately vary sensor types to close coverage gaps.

Load-bearing premise

A short 72-hour window using only three honeypot programs on Azure VMs captures representative global attack behavior without major timing or platform bias.

What would settle it

Running the identical three-honeypot configuration for several additional weeks or on a different cloud provider and obtaining substantially different protocol shares or temporal peak locations would show the original window was not representative.

read the original abstract

This data article introduces a comprehensive, high-resolution honeynet dataset designed to support standalone analyses of global cyberattack behaviors. Collected over a continuous 72-hour window (June 9 to 11, 2025) on Microsoft Azure, the dataset comprises 132,425 individual attack events captured by three honeypots (Cowrie, Dionaea, and SentryPeer) deployed across four geographically dispersed virtual machines. Each event record includes enriched metadata (UTC timestamps, source/destination IPs, autonomous system and organizational mappings, geolocation coordinates, targeted ports, and honeypot identifiers alongside derived temporal features and standardized protocol classifications). We provide actionable guidance for researchers seeking to leverage this dataset in anomaly detection, protocol-misuse studies, threat intelligence, and defensive policy design. Descriptive statistics highlight significant skew: 2,438 unique source IPs span 95 countries, yet the top 1% of IPs account for 1% of all events, and three protocols dominate: Session Initiation Protocol (SIP), Telnet, Server Message Block (SMB). Temporal analysis uncovers pronounced rush-hour peaks at 07:00 and 23:00 UTC, interspersed with maintenance-induced gaps that reveal operational blind spots. Geospatial mapping further underscores platform-specific biases: SentryPeer captures concentrated SIP floods in North America and Southeast Asia, Cowrie logs Telnet/SSH scans predominantly from Western Europe and the U.S., and Dionaea records SMB exploits around European nodes. By combining fine-grained temporal resolution with rich, contextual geolocation and protocol metadata, this standalone dataset aims to empower reproducible, cloud-scale investigations into evolving cyber threats. Accompanying analysis code and data access details are provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MURHCAD dataset, a multi-regional cloud honeypot collection comprising 132,425 attack events captured over 72 hours (June 9-11, 2025) on Microsoft Azure using Cowrie, Dionaea, and SentryPeer honeypots deployed on four VMs. It provides enriched metadata including timestamps, geolocations, protocols, and descriptive statistics on IP distributions, temporal patterns with rush-hour peaks, and platform-specific biases.

Significance. The dataset offers high-resolution data with contextual metadata that could support targeted studies in anomaly detection and protocol misuse within cloud honeypot environments. However, its value for broad standalone global analyses is constrained by the brief collection period and single-cloud provider, as evidenced by the reported skews and gaps.

major comments (1)

[Abstract] Abstract: The assertion of a 'comprehensive' dataset designed to 'support standalone analyses of global cyberattack behaviors' is not adequately supported, given the 72-hour window, maintenance-induced gaps, and explicit platform-specific geospatial and protocol skews (SentryPeer SIP in NA/SEA, Cowrie Telnet from WE/US, Dionaea SMB in Europe). These documented limitations introduce biases that undermine generalizability for global threat intelligence without additional discussion or mitigation strategies.

minor comments (2)

[Abstract] Abstract: The description of 'pronounced rush-hour peaks at 07:00 and 23:00 UTC' could benefit from more precise quantification of event rates during these periods versus baselines.
Ensure that data access links and code repositories are clearly provided in the final version for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our manuscript describing the MURHCAD dataset. We agree that the abstract's phrasing requires moderation to better reflect the dataset's temporal scope, observed biases, and appropriate use cases. Below we provide a point-by-point response and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of a 'comprehensive' dataset designed to 'support standalone analyses of global cyberattack behaviors' is not adequately supported, given the 72-hour window, maintenance-induced gaps, and explicit platform-specific geospatial and protocol skews (SentryPeer SIP in NA/SEA, Cowrie Telnet from WE/US, Dionaea SMB in Europe). These documented limitations introduce biases that undermine generalizability for global threat intelligence without additional discussion or mitigation strategies.

Authors: We agree that the current abstract language overstates the dataset's suitability for broad standalone global analyses. The 72-hour collection window, maintenance-induced gaps, single-cloud provider, and documented platform-specific skews (SIP concentration in NA/SEA, Telnet/SSH from WE/US, SMB in Europe) are already described in the manuscript but are not sufficiently foregrounded in the abstract. In the revised version we will (1) remove the words 'comprehensive' and 'standalone analyses of global cyberattack behaviors', (2) rephrase the abstract to emphasize the dataset's value for targeted, high-resolution studies of cloud honeypot traffic, protocol misuse, and anomaly detection within the observed constraints, and (3) add an explicit limitations paragraph in the main text that discusses the short duration, single-provider deployment, and geospatial/protocol biases while suggesting mitigation approaches such as cross-dataset validation. These changes will align the claims with the data and improve transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: pure data descriptor with no derivations or predictions

full rationale

This paper is a data descriptor introducing a 72-hour honeypot dataset collected on Azure. It contains no equations, derivations, predictions, fitted parameters, or load-bearing claims that reduce to self-definitions or self-citations. All content consists of direct descriptive statistics, metadata enrichment, and usage guidance drawn from the raw collection events themselves, with no circular reductions of any kind. The central claim is simply the release and documentation of the dataset, which stands independently without any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data descriptor paper. No free parameters, axioms, or invented entities are required because there are no models, derivations, or theoretical claims.

pith-pipeline@v0.9.0 · 5617 in / 1087 out tokens · 23922 ms · 2026-05-16T15:39:00.688916+00:00 · methodology

Descriptor: Multi-Regional Cloud Honeypot Dataset (MURHCAD)

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)