arxiv: 2605.03158 · v1 · submitted 2026-05-04 · 💻 cs.CR

Recognition: 2 theorem links

HackerSignal: A Large-Scale Multi-Source Dataset Linking Hacker Community Discourse to the CVE Vulnerability Lifecycle

Benjamin M. Ampel , Sagar Samtani

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3

classification 💻 cs.CR

keywords cybersecurity datasetCVE linkagehacker communitytemporal out-of-distributionexploit classificationvulnerability lifecyclethreat intelligencemulti-source aggregation

0 comments

The pith

A dataset of 7.45 million documents links hacker forums to CVE vulnerabilities and exploits across 36 years for AI testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes HackerSignal as a benchmark dataset that aggregates documents from hacker communities, exploit databases, vulnerability advisories, and software fixes. It creates linkages through shared CVE identifiers to trace full trajectories from potential exploits to resolutions while keeping source-specific details intact. The dataset supports AI tasks that test performance on future data using temporal splits rather than past patterns. A reader would care because it addresses how cybersecurity models often fail to generalize to new threats due to shifts in data over time.

Core claim

HackerSignal aggregates 7.45 million exact-deduplicated documents from 64 sources across eight layers spanning 36 years and maps the full potential exploit to vulnerability trajectory from hacker discourse, working exploits, advisories, and fix commits using CVE identifiers while preserving source release modes to enable three benchmark tasks with temporal out-of-distribution evaluation.

What carries the argument

CVE identifier space as the shared linkage that connects heterogeneous sources into lifecycle trajectories while supporting temporal splits for out-of-distribution testing.

If this is right

The dataset enables CVE linkage retrieval across sources under temporal out-of-distribution conditions.
It supports 8-class exploit type classification with temporal OOD evaluation splits.
It allows prospective generalization tests where training and test CVEs are completely disjoint.
Source-shortcut diagnostics and manual-audit packets help verify data quality for these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The temporal OOD setup allows testing of whether models rely on source-specific shortcuts or learn transferable patterns across the vulnerability lifecycle.
The long time window and preserved release modes support analysis of how information about exploits spreads between communities before official fixes appear.
Community review of the released audit packets can confirm whether the trajectories reflect actual sequences of events in practice.

Load-bearing premise

Exact deduplication and CVE-based linkage across 64 sources over 36 years produces clean trajectories without substantial missing links or source-specific artifacts that would invalidate the temporal out-of-distribution tasks.

What would settle it

An audit that identifies a high rate of incorrect CVE assignments or missing temporal connections in a random sample of the linked trajectories.

Figures

Figures reproduced from arXiv: 2605.03158 by Benjamin M. Ampel, Sagar Samtani.

**Figure 1.** Figure 1: HackerSignal pipeline overview. Dashed arrows indicate sparse, indirect CVE mentions. view at source ↗

**Figure 2.** Figure 2: Temporal coverage by source layer (log10 scale). Colored bars indicate benchmark split boundaries: train (<2022), val (2022–23), test (2024+). 3.2 CVE Identifier Space All sources are unified through a common JSONL record format and a shared CVE identifier namespace ( view at source ↗

**Figure 3.** Figure 3: Pairwise vocabulary overlap across 46 hacker community forums (5% deterministic token view at source ↗

**Figure 4.** Figure 4: Radar chart summary of baseline performance across all three tasks. view at source ↗

read the original abstract

We introduce HackerSignal, a benchmark for temporal out-of-distribution cyber threat intelligence (CTI) and cross-source CVE linkage. HackerSignal aggregates 7.45 million exact-deduplicated documents from 64 public forum/source identifiers spanning eight source layers and a 36-year window (1990-2026). In contrast to other publicly accessible cybersecurity datasets, HackerSignal is among the first public benchmark datasets that maps the full potential exploit to vulnerability trajectory from hacker community discourse, exploit databases with working and proof of concept exploits, vulnerability advisories, and software fix commits. HackerSignal creates these linkages through a shared CVE identifier space while preserving source-specific release modes to support a range of unique Artificial Intelligence (AI)-enabled cybersecurity analytics tasks. In this paper, we summarize HackerSignal and illustrate three selected benchmark tasks it uniquely supports: (1) CVE linkage retrieval (cross-source temporal out-of-distribution entity grounding); (2) exploit type classification (8-class vulnerability type prediction with temporal OOD evaluation); and (3) temporal generalization (prospective CVE-disjoint evaluation where C_train and C_test are disjoint). All tasks use temporal splits to evaluate prospective generalization. We release source-shortcut and leakage diagnostics, manual-audit packets, a datasheet, and a release-governance addendum to support the dissemination of the dataset. HackerSignal's code, data, and Croissant metadata are available at hf.co/datasets/BenAmpel/HackerSignal (data) and github.com/BenAmpel/hackersignal (code).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HackerSignal is a sizable CVE-linked dataset release with good supporting materials, but its benchmark tasks rest on unverified linkage quality.

read the letter

HackerSignal aggregates 7.45 million deduplicated documents from 64 sources across eight layers and 36 years, then ties them to CVEs for three temporal-OOD tasks: cross-source retrieval, 8-class exploit classification, and prospective generalization on disjoint CVEs. The release includes the data on Hugging Face, code on GitHub, Croissant metadata, a datasheet, manual audit packets, and diagnostics for shortcuts and leakage. That package is more complete than most dataset papers manage.

Referee Report

1 major / 1 minor

Summary. The paper introduces HackerSignal, a large-scale benchmark dataset for temporal out-of-distribution cyber threat intelligence (CTI). It aggregates 7.45 million exact-deduplicated documents from 64 public sources across eight layers spanning 1990-2026, linking hacker community discourse, exploit databases (including working and PoC exploits), vulnerability advisories, and software fix commits through a shared CVE identifier space. The manuscript illustrates three tasks supported by the dataset: (1) CVE linkage retrieval as cross-source temporal OOD entity grounding, (2) 8-class exploit type classification with temporal OOD evaluation, and (3) prospective CVE-disjoint temporal generalization. Supporting artifacts including source-shortcut/leakage diagnostics, manual-audit packets, a datasheet, and release-governance addendum are released alongside the data at the provided Hugging Face and GitHub locations.

Significance. If the CVE-based linkages produce low-noise, high-recall trajectories without substantial missing links or source-specific artifacts, HackerSignal would represent a meaningful advance as one of the first public resources enabling full-trajectory mapping from discourse to remediation for prospective generalization studies in CTI. The release of diagnostics, audit packets, and governance documentation is a positive step toward responsible data dissemination and external validation, addressing common shortcomings in cybersecurity dataset papers.

major comments (1)

[Abstract and dataset construction] Abstract and dataset construction description: The central claim that the dataset maps clean, usable exploit-to-fix trajectories for the three benchmark tasks rests on 'exact-deduplicated' aggregation and CVE-ID linkage across 64 heterogeneous sources. No quantitative results from the released manual-audit packets or diagnostics are reported (e.g., measured precision/recall of CVE matches, rate of false-positive attachments of unrelated posts to the same CVE, fraction of CVEs lacking any hacker-discourse documents, or prevalence of temporal leakage where posts predate CVE assignment). Without these metrics or an error analysis in the manuscript, it is not possible to verify that the temporal OOD splits and cross-source tasks are free of contamination or missing context that would undermine prospective generalization validity.

minor comments (1)

[Abstract] Abstract: The date range '1990-2026' should be clarified to indicate the actual coverage cutoff versus any projected or placeholder data.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for quantitative validation of the CVE linkages and error analysis. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and dataset construction] Abstract and dataset construction description: The central claim that the dataset maps clean, usable exploit-to-fix trajectories for the three benchmark tasks rests on 'exact-deduplicated' aggregation and CVE-ID linkage across 64 heterogeneous sources. No quantitative results from the released manual-audit packets or diagnostics are reported (e.g., measured precision/recall of CVE matches, rate of false-positive attachments of unrelated posts to the same CVE, fraction of CVEs lacking any hacker-discourse documents, or prevalence of temporal leakage where posts predate CVE assignment). Without these metrics or an error analysis in the manuscript, it is not possible to verify that the temporal OOD splits and cross-source tasks are free of contamination or missing context that would undermine prospective generalization validity.

Authors: We agree that the manuscript would benefit from explicit quantitative reporting of the audit and diagnostic results to substantiate the linkage quality and absence of contamination in the temporal OOD splits. While the manual-audit packets, source-shortcut diagnostics, and leakage analyses are released with the dataset to enable external verification, the current text does not include the specific metrics (e.g., CVE match precision/recall, false-positive rates, fraction of CVEs lacking hacker discourse, or temporal leakage statistics). In the revision, we will add a dedicated error analysis section reporting these values directly from the released artifacts, including an assessment of how they impact the three benchmark tasks. This will strengthen the evidence for prospective generalization validity. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with no derivation chain

full rationale

The paper is a data-release manuscript that describes aggregation of 7.45 million documents across 64 sources, exact deduplication, and CVE-ID linkage to create trajectories. No equations, fitted parameters, or mathematical derivations appear in the provided text. The three benchmark tasks (retrieval, classification, temporal splits) are defined directly from the constructed data splits rather than derived from prior results or self-referential fits. Linkage is performed via an external shared identifier space (CVE), which is an independent standard, not an internal definition. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. The work is self-contained as an empirical resource whose validity is assessed by external use, not internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CVE identifiers provide reliable cross-source linkage and that public forum data can be ethically aggregated at this scale without introducing systematic bias.

axioms (1)

domain assumption CVE identifiers uniquely and accurately link documents across heterogeneous sources without significant missing or erroneous mappings
Invoked for all three benchmark tasks and the full trajectory mapping claim

pith-pipeline@v0.9.0 · 5579 in / 1173 out tokens · 38545 ms · 2026-05-08T18:12:10.111465+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Akhtar, M., et al. (2024). Croissant: A Metadata Format for ML-Ready Datasets. NeurIPS 2024 D&B Track

2024
[2]

Ampel, B., Samtani, S., Zhu, H., Ullman, S., & Chen, H. (2020). Labeling Hacker Exploits for Proactive Cyber Threat Intelligence: A Deep Transfer Learning Approach. IEEE Intelligence and Security Informatics (ISI). https://doi.org/10.1109/ISI49825.2020.9280548

work page doi:10.1109/isi49825.2020.9280548 2020
[3]

Ampel, B., Samtani, S., Zhu, H., & Chen, H. (2024). Creating Proactive Cyber Threat Intelligence with Hacker Exploit Labels: A Deep Transfer Learning Approach. MIS Quarterly, 48(1), 137--166. https://doi.org/10.25300/MISQ/2023/17316

work page doi:10.25300/misq/2023/17316 2024
[4]

Robertson, S., & Jones, K.\ S. (1994). Simple, proven approaches to text retrieval. Technical report, University of Cambridge

1994
[5]

Thakur, N., Reimers, N., R \"u ckl \'e , A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 D&B Track. arXiv:2104.08663

work page internal anchor Pith review arXiv 2021
[6]

(2021--2026)

CISA. (2021--2026). Known Exploited Vulnerabilities Catalog. https://www.cisa.gov/known-exploited-vulnerabilities-catalog

2021
[7]

Pastrana, S., Thomas, D.\ R., Hutchings, A., & Clayton, R. (2018). CrimeBB: Enabling Cybercrime Research on Underground Forums at Scale. WWW 2018

2018
[8]

Deliu, I., Leichter, C., & Franke, K. (2017). Extracting Cyber Threat Intelligence from Hacker Forums: Support Vector Machines versus Convolutional Neural Networks. IEEE International Conference on Big Data. https://doi.org/10.1109/BigData.2017.8258359

work page doi:10.1109/bigdata.2017.8258359 2017
[9]

Bhandari, G., Naseer, A., & Moonen, L. (2021). CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. PROMISE 2021

2021
[10]

Ranade, P., Piplai, A., Joshi, A., & Finin, T. (2021). CyBERT: Contextualized Embeddings for the Cybersecurity Domain. IEEE International Conference on Big Data, 3334--3342. https://doi.org/10.1109/BigData52589.2021.9671824

work page doi:10.1109/bigdata52589.2021.9671824 2021
[11]

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019

2019
[12]

Ebrahimi, M., Samtani, S., Chai, Y., & Chen, H. (2020). Detecting Cyber Threats in Non-English Hacker Forums: An Adversarial Cross-Lingual Knowledge Transfer Approach. IEEE Security and Privacy Workshops, 20--26. https://doi.org/10.1109/SPW50608.2020.00021

work page doi:10.1109/spw50608.2020.00021 2020
[13]

(2003--2026)

Offensive Security. (2003--2026). Exploit Database. https://www.exploit-db.com/. Licensed CC BY-SA 4.0

2003
[14]

Gayanku. (2020). Hacker Forum Posts Dataset. Kaggle. https://www.kaggle.com/gayanku

2020
[15]

Gebru, T., et al. (2021). Datasheets for Datasets. Communications of the ACM, 64(12), 86--92

2021
[16]

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46. https://doi.org/10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[17]

(2017--2026)

GitHub. (2017--2026). GitHub Advisory Database. https://github.com/advisories. Licensed CC BY 4.0

2017
[18]

Grisham, J., Samtani, S., Patton, M., & Chen, H. (2017). Identifying Mobile Malware and Key Threat Actors in Online Hacker Forums for Proactive Cyber Threat Intelligence. IEEE Intelligence and Security Informatics (ISI)

2017
[19]

Otto, K., Ampel, B., Zhu, H., Samtani, S., & Chen, H. (2021). Exploring the Evolution of Exploit-Sharing Hackers: An Unsupervised Graph Embedding Approach. IEEE Intelligence and Security Informatics (ISI). https://doi.org/10.1109/ISI53945.2021.9624846

work page doi:10.1109/isi53945.2021.9624846 2021
[20]

(2002--2026)

NIST. (2002--2026). National Vulnerability Database. https://nvd.nist.gov/. Public domain

2002
[21]

Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., et al. (2021). KILT: A Benchmark for Knowledge Intensive Language Tasks. NAACL 2021. https://doi.org/10.18653/v1/2021.naacl-main.200

work page doi:10.18653/v1/2021.naacl-main.200 2021
[22]

Nunes, E., Diab, A., Gunn, A., Marin, E., Mishra, V., Paliath, V., Robertson, J., Shakarian, J., Thart, A., & Shakarian, P. (2016). Darknet and Deepnet Mining for Proactive Cybersecurity Threat Intelligence. IEEE Intelligence and Security Informatics (ISI)

2016
[23]

Jackaduma. (2022). SecBERT: A Pre-trained Language Model for Cybersecurity. https://huggingface.co/jackaduma/SecBERT

2022
[24]

Krippendorff, K. (2008). Systematic and random disagreement and the reliability of nominal data. Communication Methods and Measures, 2(4), 323--338. https://doi.org/10.1080/19312450802467134

work page doi:10.1080/19312450802467134 2008
[25]

Gwet, K.\ L. (2014). Handbook of Inter-Rater Reliability. Advanced Analytics

2014
[26]

Kapoor, S., & Narayanan, A. (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9), 100804. https://doi.org/10.1016/j.patter.2023.100804

work page doi:10.1016/j.patter.2023.100804 2023
[27]

Koh, P.\ W., Sagawa, S., Marklund, H., Xie, S.\ M., Zhang, M., et al. (2021). WILDS: A Benchmark of in-the-Wild Distribution Shifts. ICML 2021, PMLR 139

2021
[28]

Han, Z., Li, X., Xing, Z., Liu, H., & Feng, Z. (2017). Learning to Predict Severity of Software Vulnerability Using Only Vulnerability Description. ICSME 2017, 125--136

2017
[29]

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019

2019
[30]

Samtani, S., Chinn, R., & Chen, H. (2015). Exploring Hacker Assets in Underground Forums. IEEE Intelligence and Security Informatics (ISI), 31--36. https://doi.org/10.1109/ISI.2015.7165935

work page doi:10.1109/isi.2015.7165935 2015
[31]

Samtani, S., Chinn, R., Chen, H., & Nunamaker, J.\ F. (2017). Exploring Emerging Hacker Assets and Key Hackers for Proactive Cyber Threat Intelligence. Journal of Management Information Systems, 34(4), 1023--1053. https://doi.org/10.1080/07421222.2017.1394049

work page doi:10.1080/07421222.2017.1394049 2017
[32]

Samtani, S., Zhu, H., & Chen, H. (2020). Proactively Identifying Emerging Hacker Threats from the Dark Web: A Diachronic Graph Embedding Framework. ACM Transactions on Privacy and Security, 23(4), Article 21. https://doi.org/10.1145/3409289

work page doi:10.1145/3409289 2020
[33]

Samtani, S., Li, W., Benjamin, V., & Chen, H. (2021). Informing Cyber Threat Intelligence through Dark Web Situational Awareness: The AZSecure Hacker Assets Portal. Digital Threats: Research and Practice, 2(4), Article 27. https://doi.org/10.1145/3450972

work page doi:10.1145/3450972 2021
[34]

Samtani, S., Chai, Y., & Chen, H. (2022). Linking Exploits from the Dark Web to Known Vulnerabilities for Proactive Cyber Threat Intelligence: An Attention-Based Deep Structured Semantic Model. MIS Quarterly, 46(2), 911--946

2022
[35]

Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., & Ratner, A. (2021). WRENCH: A Comprehensive Benchmark for Weak Supervision. NeurIPS 2021 D&B Track. arXiv:2109.11377

work page arXiv 2021
[36]

Rahman, M.\ R., Mahdavi-Hezaveh, R., & Williams, L. (2023). What Are the Attackers Doing Now? Automating Cyberthreat Intelligence Extraction from Text on Pace with the Changing Threat Landscape: A Survey. ACM Computing Surveys, 55(12), Article 241. https://doi.org/10.1145/3571726

work page doi:10.1145/3571726 2023
[37]

Hughes, J., Pastrana, S., Hutchings, A., Afroz, S., Samtani, S., Li, W., & Marin, E.\ S. (2024). The Art of Cybercrime Community Research. ACM Computing Surveys, 56(6), Article 155. https://doi.org/10.1145/3639362

work page doi:10.1145/3639362 2024