Recognition: 2 theorem links
HackerSignal: A Large-Scale Multi-Source Dataset Linking Hacker Community Discourse to the CVE Vulnerability Lifecycle
Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3
The pith
A dataset of 7.45 million documents links hacker forums to CVE vulnerabilities and exploits across 36 years for AI testing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HackerSignal aggregates 7.45 million exact-deduplicated documents from 64 sources across eight layers spanning 36 years and maps the full potential exploit to vulnerability trajectory from hacker discourse, working exploits, advisories, and fix commits using CVE identifiers while preserving source release modes to enable three benchmark tasks with temporal out-of-distribution evaluation.
What carries the argument
CVE identifier space as the shared linkage that connects heterogeneous sources into lifecycle trajectories while supporting temporal splits for out-of-distribution testing.
If this is right
- The dataset enables CVE linkage retrieval across sources under temporal out-of-distribution conditions.
- It supports 8-class exploit type classification with temporal OOD evaluation splits.
- It allows prospective generalization tests where training and test CVEs are completely disjoint.
- Source-shortcut diagnostics and manual-audit packets help verify data quality for these tasks.
Where Pith is reading between the lines
- The temporal OOD setup allows testing of whether models rely on source-specific shortcuts or learn transferable patterns across the vulnerability lifecycle.
- The long time window and preserved release modes support analysis of how information about exploits spreads between communities before official fixes appear.
- Community review of the released audit packets can confirm whether the trajectories reflect actual sequences of events in practice.
Load-bearing premise
Exact deduplication and CVE-based linkage across 64 sources over 36 years produces clean trajectories without substantial missing links or source-specific artifacts that would invalidate the temporal out-of-distribution tasks.
What would settle it
An audit that identifies a high rate of incorrect CVE assignments or missing temporal connections in a random sample of the linked trajectories.
Figures
read the original abstract
We introduce HackerSignal, a benchmark for temporal out-of-distribution cyber threat intelligence (CTI) and cross-source CVE linkage. HackerSignal aggregates 7.45 million exact-deduplicated documents from 64 public forum/source identifiers spanning eight source layers and a 36-year window (1990-2026). In contrast to other publicly accessible cybersecurity datasets, HackerSignal is among the first public benchmark datasets that maps the full potential exploit to vulnerability trajectory from hacker community discourse, exploit databases with working and proof of concept exploits, vulnerability advisories, and software fix commits. HackerSignal creates these linkages through a shared CVE identifier space while preserving source-specific release modes to support a range of unique Artificial Intelligence (AI)-enabled cybersecurity analytics tasks. In this paper, we summarize HackerSignal and illustrate three selected benchmark tasks it uniquely supports: (1) CVE linkage retrieval (cross-source temporal out-of-distribution entity grounding); (2) exploit type classification (8-class vulnerability type prediction with temporal OOD evaluation); and (3) temporal generalization (prospective CVE-disjoint evaluation where C_train and C_test are disjoint). All tasks use temporal splits to evaluate prospective generalization. We release source-shortcut and leakage diagnostics, manual-audit packets, a datasheet, and a release-governance addendum to support the dissemination of the dataset. HackerSignal's code, data, and Croissant metadata are available at hf.co/datasets/BenAmpel/HackerSignal (data) and github.com/BenAmpel/hackersignal (code).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HackerSignal, a large-scale benchmark dataset for temporal out-of-distribution cyber threat intelligence (CTI). It aggregates 7.45 million exact-deduplicated documents from 64 public sources across eight layers spanning 1990-2026, linking hacker community discourse, exploit databases (including working and PoC exploits), vulnerability advisories, and software fix commits through a shared CVE identifier space. The manuscript illustrates three tasks supported by the dataset: (1) CVE linkage retrieval as cross-source temporal OOD entity grounding, (2) 8-class exploit type classification with temporal OOD evaluation, and (3) prospective CVE-disjoint temporal generalization. Supporting artifacts including source-shortcut/leakage diagnostics, manual-audit packets, a datasheet, and release-governance addendum are released alongside the data at the provided Hugging Face and GitHub locations.
Significance. If the CVE-based linkages produce low-noise, high-recall trajectories without substantial missing links or source-specific artifacts, HackerSignal would represent a meaningful advance as one of the first public resources enabling full-trajectory mapping from discourse to remediation for prospective generalization studies in CTI. The release of diagnostics, audit packets, and governance documentation is a positive step toward responsible data dissemination and external validation, addressing common shortcomings in cybersecurity dataset papers.
major comments (1)
- [Abstract and dataset construction] Abstract and dataset construction description: The central claim that the dataset maps clean, usable exploit-to-fix trajectories for the three benchmark tasks rests on 'exact-deduplicated' aggregation and CVE-ID linkage across 64 heterogeneous sources. No quantitative results from the released manual-audit packets or diagnostics are reported (e.g., measured precision/recall of CVE matches, rate of false-positive attachments of unrelated posts to the same CVE, fraction of CVEs lacking any hacker-discourse documents, or prevalence of temporal leakage where posts predate CVE assignment). Without these metrics or an error analysis in the manuscript, it is not possible to verify that the temporal OOD splits and cross-source tasks are free of contamination or missing context that would undermine prospective generalization validity.
minor comments (1)
- [Abstract] Abstract: The date range '1990-2026' should be clarified to indicate the actual coverage cutoff versus any projected or placeholder data.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need for quantitative validation of the CVE linkages and error analysis. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and dataset construction] Abstract and dataset construction description: The central claim that the dataset maps clean, usable exploit-to-fix trajectories for the three benchmark tasks rests on 'exact-deduplicated' aggregation and CVE-ID linkage across 64 heterogeneous sources. No quantitative results from the released manual-audit packets or diagnostics are reported (e.g., measured precision/recall of CVE matches, rate of false-positive attachments of unrelated posts to the same CVE, fraction of CVEs lacking any hacker-discourse documents, or prevalence of temporal leakage where posts predate CVE assignment). Without these metrics or an error analysis in the manuscript, it is not possible to verify that the temporal OOD splits and cross-source tasks are free of contamination or missing context that would undermine prospective generalization validity.
Authors: We agree that the manuscript would benefit from explicit quantitative reporting of the audit and diagnostic results to substantiate the linkage quality and absence of contamination in the temporal OOD splits. While the manual-audit packets, source-shortcut diagnostics, and leakage analyses are released with the dataset to enable external verification, the current text does not include the specific metrics (e.g., CVE match precision/recall, false-positive rates, fraction of CVEs lacking hacker discourse, or temporal leakage statistics). In the revision, we will add a dedicated error analysis section reporting these values directly from the released artifacts, including an assessment of how they impact the three benchmark tasks. This will strengthen the evidence for prospective generalization validity. revision: yes
Circularity Check
No circularity: dataset construction paper with no derivation chain
full rationale
The paper is a data-release manuscript that describes aggregation of 7.45 million documents across 64 sources, exact deduplication, and CVE-ID linkage to create trajectories. No equations, fitted parameters, or mathematical derivations appear in the provided text. The three benchmark tasks (retrieval, classification, temporal splits) are defined directly from the constructed data splits rather than derived from prior results or self-referential fits. Linkage is performed via an external shared identifier space (CVE), which is an independent standard, not an internal definition. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. The work is self-contained as an empirical resource whose validity is assessed by external use, not internal reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CVE identifiers uniquely and accurately link documents across heterogeneous sources without significant missing or erroneous mappings
Reference graph
Works this paper leans on
-
[1]
Akhtar, M., et al. (2024). Croissant: A Metadata Format for ML-Ready Datasets. NeurIPS 2024 D&B Track
2024
-
[2]
Ampel, B., Samtani, S., Zhu, H., Ullman, S., & Chen, H. (2020). Labeling Hacker Exploits for Proactive Cyber Threat Intelligence: A Deep Transfer Learning Approach. IEEE Intelligence and Security Informatics (ISI). https://doi.org/10.1109/ISI49825.2020.9280548
-
[3]
Ampel, B., Samtani, S., Zhu, H., & Chen, H. (2024). Creating Proactive Cyber Threat Intelligence with Hacker Exploit Labels: A Deep Transfer Learning Approach. MIS Quarterly, 48(1), 137--166. https://doi.org/10.25300/MISQ/2023/17316
-
[4]
Robertson, S., & Jones, K.\ S. (1994). Simple, proven approaches to text retrieval. Technical report, University of Cambridge
1994
-
[5]
Thakur, N., Reimers, N., R \"u ckl \'e , A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 D&B Track. arXiv:2104.08663
work page internal anchor Pith review arXiv 2021
-
[6]
(2021--2026)
CISA. (2021--2026). Known Exploited Vulnerabilities Catalog. https://www.cisa.gov/known-exploited-vulnerabilities-catalog
2021
-
[7]
Pastrana, S., Thomas, D.\ R., Hutchings, A., & Clayton, R. (2018). CrimeBB: Enabling Cybercrime Research on Underground Forums at Scale. WWW 2018
2018
-
[8]
Deliu, I., Leichter, C., & Franke, K. (2017). Extracting Cyber Threat Intelligence from Hacker Forums: Support Vector Machines versus Convolutional Neural Networks. IEEE International Conference on Big Data. https://doi.org/10.1109/BigData.2017.8258359
-
[9]
Bhandari, G., Naseer, A., & Moonen, L. (2021). CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. PROMISE 2021
2021
-
[10]
Ranade, P., Piplai, A., Joshi, A., & Finin, T. (2021). CyBERT: Contextualized Embeddings for the Cybersecurity Domain. IEEE International Conference on Big Data, 3334--3342. https://doi.org/10.1109/BigData52589.2021.9671824
-
[11]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019
2019
-
[12]
Ebrahimi, M., Samtani, S., Chai, Y., & Chen, H. (2020). Detecting Cyber Threats in Non-English Hacker Forums: An Adversarial Cross-Lingual Knowledge Transfer Approach. IEEE Security and Privacy Workshops, 20--26. https://doi.org/10.1109/SPW50608.2020.00021
-
[13]
(2003--2026)
Offensive Security. (2003--2026). Exploit Database. https://www.exploit-db.com/. Licensed CC BY-SA 4.0
2003
-
[14]
Gayanku. (2020). Hacker Forum Posts Dataset. Kaggle. https://www.kaggle.com/gayanku
2020
-
[15]
Gebru, T., et al. (2021). Datasheets for Datasets. Communications of the ACM, 64(12), 86--92
2021
-
[16]
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46. https://doi.org/10.1177/001316446002000104
-
[17]
(2017--2026)
GitHub. (2017--2026). GitHub Advisory Database. https://github.com/advisories. Licensed CC BY 4.0
2017
-
[18]
Grisham, J., Samtani, S., Patton, M., & Chen, H. (2017). Identifying Mobile Malware and Key Threat Actors in Online Hacker Forums for Proactive Cyber Threat Intelligence. IEEE Intelligence and Security Informatics (ISI)
2017
-
[19]
Otto, K., Ampel, B., Zhu, H., Samtani, S., & Chen, H. (2021). Exploring the Evolution of Exploit-Sharing Hackers: An Unsupervised Graph Embedding Approach. IEEE Intelligence and Security Informatics (ISI). https://doi.org/10.1109/ISI53945.2021.9624846
-
[20]
(2002--2026)
NIST. (2002--2026). National Vulnerability Database. https://nvd.nist.gov/. Public domain
2002
-
[21]
Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., et al. (2021). KILT: A Benchmark for Knowledge Intensive Language Tasks. NAACL 2021. https://doi.org/10.18653/v1/2021.naacl-main.200
-
[22]
Nunes, E., Diab, A., Gunn, A., Marin, E., Mishra, V., Paliath, V., Robertson, J., Shakarian, J., Thart, A., & Shakarian, P. (2016). Darknet and Deepnet Mining for Proactive Cybersecurity Threat Intelligence. IEEE Intelligence and Security Informatics (ISI)
2016
-
[23]
Jackaduma. (2022). SecBERT: A Pre-trained Language Model for Cybersecurity. https://huggingface.co/jackaduma/SecBERT
2022
-
[24]
Krippendorff, K. (2008). Systematic and random disagreement and the reliability of nominal data. Communication Methods and Measures, 2(4), 323--338. https://doi.org/10.1080/19312450802467134
-
[25]
Gwet, K.\ L. (2014). Handbook of Inter-Rater Reliability. Advanced Analytics
2014
-
[26]
Kapoor, S., & Narayanan, A. (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9), 100804. https://doi.org/10.1016/j.patter.2023.100804
-
[27]
Koh, P.\ W., Sagawa, S., Marklund, H., Xie, S.\ M., Zhang, M., et al. (2021). WILDS: A Benchmark of in-the-Wild Distribution Shifts. ICML 2021, PMLR 139
2021
-
[28]
Han, Z., Li, X., Xing, Z., Liu, H., & Feng, Z. (2017). Learning to Predict Severity of Software Vulnerability Using Only Vulnerability Description. ICSME 2017, 125--136
2017
-
[29]
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019
2019
-
[30]
Samtani, S., Chinn, R., & Chen, H. (2015). Exploring Hacker Assets in Underground Forums. IEEE Intelligence and Security Informatics (ISI), 31--36. https://doi.org/10.1109/ISI.2015.7165935
-
[31]
Samtani, S., Chinn, R., Chen, H., & Nunamaker, J.\ F. (2017). Exploring Emerging Hacker Assets and Key Hackers for Proactive Cyber Threat Intelligence. Journal of Management Information Systems, 34(4), 1023--1053. https://doi.org/10.1080/07421222.2017.1394049
-
[32]
Samtani, S., Zhu, H., & Chen, H. (2020). Proactively Identifying Emerging Hacker Threats from the Dark Web: A Diachronic Graph Embedding Framework. ACM Transactions on Privacy and Security, 23(4), Article 21. https://doi.org/10.1145/3409289
-
[33]
Samtani, S., Li, W., Benjamin, V., & Chen, H. (2021). Informing Cyber Threat Intelligence through Dark Web Situational Awareness: The AZSecure Hacker Assets Portal. Digital Threats: Research and Practice, 2(4), Article 27. https://doi.org/10.1145/3450972
-
[34]
Samtani, S., Chai, Y., & Chen, H. (2022). Linking Exploits from the Dark Web to Known Vulnerabilities for Proactive Cyber Threat Intelligence: An Attention-Based Deep Structured Semantic Model. MIS Quarterly, 46(2), 911--946
2022
- [35]
-
[36]
Rahman, M.\ R., Mahdavi-Hezaveh, R., & Williams, L. (2023). What Are the Attackers Doing Now? Automating Cyberthreat Intelligence Extraction from Text on Pace with the Changing Threat Landscape: A Survey. ACM Computing Surveys, 55(12), Article 241. https://doi.org/10.1145/3571726
-
[37]
Hughes, J., Pastrana, S., Hutchings, A., Afroz, S., Samtani, S., Li, W., & Marin, E.\ S. (2024). The Art of Cybercrime Community Research. ACM Computing Surveys, 56(6), Article 155. https://doi.org/10.1145/3639362
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.