pith. machine review for the scientific record. sign in

arxiv: 2604.06967 · v1 · submitted 2026-04-08 · 💻 cs.CR · cs.DB

Recognition: 2 theorem links

· Lean Theorem

VulGD: A LLM-Powered Dynamic Open-Access Vulnerability Graph Database

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3

classification 💻 cs.CR cs.DB
keywords Vulnerability databaseGraph databaseLLM embeddingsCybersecurityRisk assessmentThreat prioritizationOpen accessDynamic database
0
0 comments X

The pith

VulGD is a dynamic open-access graph database that aggregates vulnerability data and uses LLM embeddings to enhance risk assessment and threat prioritization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VulGD as a system that builds and maintains a graph database of software vulnerabilities by pulling data from public sources in real time. It includes a web interface and API so users can explore connections between vulnerabilities without complex setup. The key addition is the use of embeddings from large language models to represent vulnerability descriptions, which the authors say allows for better evaluation of risks and decisions on which threats to address first. A sympathetic reader would care because traditional databases use flat tables that miss relationships, and previous graph approaches were not easy to use or update. If the system works as described, it offers a ready tool for researchers and security teams to analyze threats more effectively.

Core claim

VulGD continuously aggregates cybersecurity data from authoritative repositories into a graph structure, offers unified access through a web interface and public API for interactive exploration, and incorporates LLM embeddings to enrich vulnerability descriptions, thereby facilitating more accurate risk assessment and threat prioritization.

What carries the argument

LLM embeddings that enrich vulnerability description representations within the dynamic graph database.

If this is right

  • Provides real-time multi-source data integration without requiring complex user setup.
  • Enables both expert and non-expert users to perform interactive graph exploration and automated data access.
  • Supports improved vulnerability risk assessment through enriched representations.
  • Aids in threat prioritization for cybersecurity decision-making.
  • Serves as an extensible platform open to public use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be applied to integrate additional data sources like exploit databases or reports on emerging threats.
  • Combining graph traversals with embedding similarity searches might reveal hidden patterns in vulnerability chains.
  • If scaled, it might influence how organizations standardize vulnerability data exchange beyond current formats.
  • This suggests a shift toward hybrid graph-semantic systems for other domains involving interconnected risks.

Load-bearing premise

The assumption that LLM embeddings will lead to more accurate vulnerability risk assessment and threat prioritization, since the paper provides no validation metrics or comparisons to support this improvement.

What would settle it

A study that measures and compares the precision of risk scores or prioritization rankings derived from VulGD against those from standard relational databases or graph systems without LLM embeddings would test the central claim.

Figures

Figures reproduced from arXiv: 2604.06967 by Hua Wang, Jiao Yin, Jinli Cao, Luat Do.

Figure 1
Figure 1. Figure 1: Overview of the VulGD system architecture. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Design of the sub-pipeline for individual data source. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of the VulGD dynamic data pipeline. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VulGD web interface with dedicated tools for data retrieval and visualization. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Annual number of CVE entries represented in VulGD, grouped by publication year. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Software vulnerabilities continue to pose significant threats to modern information systems, requiring a timely and accurate risk assessment. Public repositories, such as the National Vulnerability Database and CVE details, are regularly updated, but predominantly utilize relational data models that lack native support for representing complex, interconnected structures. To address this, recent research has proposed graph-based vulnerability models. However, these systems often require complex setup procedures, lack real-time multi-source integration, and offer limited accessibility for direct data retrieval and analysis. We present VulGD, a dynamic open-access vulnerability graph database that continuously aggregates cybersecurity data from authoritative repositories. Designed for both expert and non-expert users, VulGD provides a unified web interface and a public API for interactive graph exploration and automated data access. Additionally, VulGD integrates embeddings from large language models (LLMs) to enrich vulnerability description representations, facilitating more accurate vulnerability risk assessment and threat prioritization. VulGD represents a practical and extensible platform for cybersecurity research and decision-making. The live system is publicly accessible at http://34.129.186.158/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents VulGD, a dynamic open-access vulnerability graph database that continuously aggregates data from sources such as the National Vulnerability Database and CVE repositories. It offers a unified web interface and public API for interactive graph exploration and automated access, while integrating LLM embeddings to enrich vulnerability description representations and thereby facilitate more accurate risk assessment and threat prioritization. The system is positioned as practical, extensible, and publicly accessible via a provided URL.

Significance. If the described integration and accessibility features operate as claimed, VulGD could serve as a convenient platform for cybersecurity researchers and practitioners seeking graph-based vulnerability data. The combination of relational-to-graph conversion with LLM embeddings has conceptual appeal for semantic enrichment. However, the absence of any empirical validation means the claimed accuracy improvements remain unproven and the overall significance for advancing risk assessment is limited to the utility of the data aggregation and interface alone.

major comments (1)
  1. [Abstract] Abstract: The central claim that LLM embeddings 'facilitate more accurate vulnerability risk assessment and threat prioritization' is presented without any quantitative evaluation, baseline comparisons, metrics (e.g., precision/recall on prioritization tasks), or ablation studies. This assumption is load-bearing for the paper's 'LLM-Powered' framing and differentiator from prior graph-based vulnerability models, yet the manuscript supplies only a system description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract claim regarding LLM embeddings requires qualification, as the work is primarily a system description. We address the comment below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that LLM embeddings 'facilitate more accurate vulnerability risk assessment and threat prioritization' is presented without any quantitative evaluation, baseline comparisons, metrics (e.g., precision/recall on prioritization tasks), or ablation studies. This assumption is load-bearing for the paper's 'LLM-Powered' framing and differentiator from prior graph-based vulnerability models, yet the manuscript supplies only a system description.

    Authors: We acknowledge that the manuscript provides a system description without quantitative experiments, ablation studies, or task-specific metrics to substantiate improvements in risk assessment accuracy. The LLM integration is presented as a feature for semantic enrichment of vulnerability descriptions via embeddings, drawing on established NLP techniques, rather than as a fully evaluated component. To address this, we will revise the abstract to replace the phrasing 'facilitating more accurate vulnerability risk assessment and threat prioritization' with 'providing enriched representations to support vulnerability risk assessment and threat prioritization.' We will also add a brief discussion section clarifying the conceptual motivation, citing related work on LLM embeddings in cybersecurity, and explicitly noting that empirical validation of downstream task performance is left for future work. This maintains the 'LLM-Powered' framing as descriptive of the architecture while removing unsupported performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive system paper without derivations or fitted claims

full rationale

The manuscript presents VulGD as a constructed platform for aggregating vulnerability data with LLM embeddings for enrichment. No equations, predictions, fitted parameters, or derivation chains exist. The LLM integration is described as a design choice to 'enrich vulnerability description representations' without any reduction to prior results by construction, self-citation load-bearing, or renaming of known patterns. Claims about improved accuracy are stated as intended benefits but rest on untested assumptions rather than circular logic. This is a standard descriptive systems paper whose central content is independent of any self-referential inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on domain assumptions about LLM utility and data integration benefits without supporting evidence or external benchmarks detailed in the abstract.

axioms (1)
  • domain assumption LLM embeddings improve accuracy of vulnerability risk assessment
    Invoked to justify the integration but unsupported by any metrics or comparisons in the provided abstract.

pith-pipeline@v0.9.0 · 5483 in / 1027 out tokens · 33838 ms · 2026-05-10T17:59:41.514351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 22 canonical work pages

  1. [1]

    J. Yin, M. Tang, J. Cao, M. You, H. Wang, Cybersecurity applications in software: data-driven software vulnerability assessment and management, in: Emerging trends in cybersecurity applications, Springer, 2022, pp. 371– 389

  2. [2]

    National Cyber Security Centre, MOVEit vulnerability and data ex- tortion incident,https://www.ncsc.gov.uk/information/moveit- vulnerability(n.d.)

  3. [3]

    News.com.au, Aussie superannuation funds hit in ma- jor cyberattack,https://www.news.com.au/national/aussie- superannuation-funds-hit-in-major-cyberattack/news-story/ a39634e07fe0c8b9458d472888311abd(2025)

  4. [4]

    X. Sun, Z. Wang, Intelligent association of CVE vulnerabilities based on chain reasoning, in: Advances in Artificial Intelligence, Big Data and Al- gorithms, Vol. 373 of Frontiers in Artificial Intelligence and Applications, IOS Press, 2023, pp. 28–34.doi:10.3233/FAIA230788. URLhttps://ebooks.iospress.nl/volumearticle/65409 19

  5. [5]

    H. N. Security, Vulnerability management complexity hinders security ef- forts (2025). URLhttps://www.helpnetsecurity.com/2025/01/16/vulnerability- management-complexity/

  6. [6]

    big beast to tackle

    T. Geras, T. Schreck, The "big beast to tackle": Practices in quality as- surance for cyber threat intelligence, in: Proceedings of the 27th Interna- tional Symposium on Research in Attacks, Intrusions and Defenses, RAID ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 337–352.doi:10.1145/3678890.3678903. URLhttps://doi.org/10.1145/36...

  7. [7]

    E.Kiesling, A.Ekelhart, K.Kurniawan, F.Ekaputra, The sepsesknowledge graph: An integrated resource for cybersecurity, in: The Semantic Web – ISWC2019, 2019, describesSEPSES,acybersecurityKGintegratingpublic vulnerability and attack data using Semantic Web technologies; supports use cases like intrusion detection

  8. [8]

    J. Yin, W. Hong, H. Wang, J. Cao, Y. Miao, Y. Zhang, A compact vulnerability knowledge graph for risk assessment, ACM Transactions on Knowledge Discovery from DataIntroduces VulKG, a compact vul- nerability knowledge graph (276K+ nodes, 1M+ edges) for risk assess- ment; demonstrates its use in co-exploitation behavior analysis. (2024). doi:10.1145/3671005...

  9. [9]

    A. M. Høst, P. Lison, L. Moonen, Constructing a knowledge graph from textual descriptions of software vulnerabilities in the national vulnerability database (2023).arXiv:2305.00382. URLhttps://arxiv.org/abs/2305.00382

  10. [10]

    J. Yin, G. Chen, W. Hong, H. Wang, J. Cao, Y. Miao, Empowering vul- nerability prioritization: a heterogeneous graph-driven framework for ex- ploitability prediction, in: International conference on web information sys- tems engineering, Springer, 2023, pp. 289–299

  11. [11]

    J. Yin, G. Chen, W. Hong, J. Cao, H. Wang, Y. Miao, A heterogeneous graph-basedsemi-supervisedlearningframeworkforaccesscontroldecision- making, World Wide Web 27 (4) (2024) 35

  12. [12]

    X. Kong, X. Song, F. Xia, H. Guo, J. Wang, A. Tolba, Lotad: long-term trafficanomalydetectionbasedoncrowdsourcedbustrajectorydata, World Wide Web 21 (3) (2018) 825–847.doi:10.1007/s11280-017-0487-4. URLhttps://doi.org/10.1007/s11280-017-0487-4

  13. [13]

    S. Noel, E. Harley, K. Tam, M. Limiero, M. Share, CyGraph: Graph- Based Analytics and Visualization for Cybersecurity, 2016.doi:10.1016/ bs.host.2016.07.001. 20

  14. [14]

    Y. Jia, Y. Qi, H. Shang, R. Jiang, A. Li, A practical approach to construct- ing a knowledge graph for cybersecurity, Engineering 4 (1) (2018) 53–60. doi:10.1016/j.eng.2018.01.004

  15. [15]

    Y. Sun, D. Lin, H. Song, M. Yan, L. Cao, A method to construct vulnera- bility knowledge graph based on heterogeneous data, in: Proceedings of the 16th International Conference on Mobility, Sensing and Networking (MSN ’20), IEEE, 2020, pp. 740–745

  16. [16]

    S. Qin, K. P. Chow, Automatic analysis and reasoning based on vulnera- bility knowledge graph, in: Cyberspace Data and Intelligence, and Cyber- Living, Syndrome, and Health, Springer, 2019, pp. 3–19

  17. [17]

    Y. Wang, X. Hou, X. Ma, Q. Lv, A software security entity relationships prediction framework based on knowledge graph embedding using sentence- bert, in: Proceedings of the International Conference on Wireless Algo- rithms, Systems, and Applications, Springer, 2022, pp. 501–513

  18. [18]

    H. Xiao, Z. Xing, X. Li, H. Guo, Embedding and predicting software secu- rity entity relationships: A knowledge graph based approach, in: Proceed- ings of the 26th International Conference on Neural Information Processing (ICONIP ’19), Part III, Springer, 2019, pp. 50–63

  19. [19]

    L. Yuan, Y. Bai, Z. Xing, S. Chen, X. Li, Z. Deng, Predicting entity rela- tions across different security databases by using graph attention network, in: Proceedings of the IEEE 45th Annual Computers, Software, and Ap- plications Conference (COMPSAC ’21), IEEE, 2021, pp. 834–843

  20. [20]

    J. Yin, M. Tang, J. Cao, M. You, H. Wang, M. Alazab, Knowledge-driven cybersecurity intelligence: Software vulnerability coexploitation behavior discovery, IEEE Transactions on Industrial Informatics 19 (4) (2023) 5593– 5601.doi:10.1109/TII.2022.3192027

  21. [21]

    Mishra, H

    C. Mishra, H. Sarma, S. M., Pagellm: Incremental approach for updating a security knowledge graph by using page ranking and large language model, Information Processing & Management 62 (3) (2025) 104045. doi:10.1016/j.ipm.2024.104045. URLhttps://www.sciencedirect.com/science/article/pii/ S0306457324004047

  22. [22]

    Towards XAI in the SOC – A User- Centric Study of Explainable Alerts with SHAP and LIME, in: 2022 IEEE International Conference on Big Data (Big Data), IEEE, Osaka, Japan

    M. Barry, A. Bifet, R. Chiky, S. El Jaouhari, J. Montiel, A. El Ouafi, E. Guerizec, Stream2graph: Dynamic knowledge graph for online learn- ing applied in large-scale network, in: 2022 IEEE International Con- ference on Big Data (Big Data), 2022, pp. 2190–2197.doi:10.1109/ BigData55660.2022.10020885

  23. [23]

    N. D. F. JSON,https://nvd.nist.gov/vuln/data-feeds#JSON_FEED. 21

  24. [24]

    S. Project, cyber-kg-converter: A toolset for converting cybersecurity datasets into rdf-based knowledge graphs,https://github.com/sepses/ cyber-kg-converter, accessed: 2025-04-14 (2019)

  25. [25]

    J. Yin, M. Tang, J. Cao, H. Wang, Apply transfer learning to cybersecurity: Predictingexploitabilityofvulnerabilitiesbydescription, Knowledge-Based Systems 210 (2020) 106529

  26. [26]

    Physical Safety

    H.Xu, S.Wang, N.Li, K.Wang, Y.Zhao, K.Chen, T.Yu, Y.Liu, H.Wang, Large language models for cyber security: A systematic literature review, arXiv preprint arXiv:2405.04760 (2024). URLhttps://arxiv.org/abs/2405.04760

  27. [27]

    Zhang, H

    J. Zhang, H. Bu, H. Wen, Y. Liu, H. Fei, R. Xi, L. Li, Y. Yang, H. Zhu, D. Meng, When llms meet cybersecurity: A systematic literature review, arXiv preprint arXiv:2405.03644 (2024).doi:10.48550/arXiv.2405.03644. URLhttps://arxiv.org/abs/2405.03644

  28. [28]

    Huang, Y

    H. Huang, Y. Wang, Secbert: Privacy-preserving pre-training based neural network inference system, Neural Networks 172 (2024) 106135. doi:https://doi.org/10.1016/j.neunet.2024.106135. URLhttps://www.sciencedirect.com/science/article/pii/ S0893608024000510

  29. [29]

    Available: https://arxiv.org/abs/2407.02528

    R. Fieblinger, M. T. Alam, N. Rastogi, Actionable cyber threat intelli- gence using knowledge graphs and large language models (2024).arXiv: 2407.02528. URLhttps://arxiv.org/abs/2407.02528

  30. [30]

    M. Xie, T. Rahat, W. Wang, Y. Tian, Using program knowledge graph to uncover software vulnerabilities (2023).arXiv:2312.04818. URLhttps://arxiv.org/abs/2312.04818

  31. [31]

    Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey on large language model (llm) security and privacy: The good, the bad, and the ugly, High-Confidence Computing 4 (2) (2024) 100211.doi:10.1016/ j.hcc.2024.100211. URLhttp://dx.doi.org/10.1016/j.hcc.2024.100211

  32. [32]

    H. Ma, P. Lv, K. Chen, J. Zhou, Kgdist: A prompt-based distillation attack against lms augmented with knowledge graphs, in: Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 480–495.doi:10.1145/3678890.3678906. URLhttps://doi.org/10.11...

  33. [33]

    Alfasi, T

    D. Alfasi, T. Shapira, A. B. Barr, Unveiling hidden links between unseen security entities (2024).arXiv:2403.02014. URLhttps://arxiv.org/abs/2403.02014 22

  34. [34]

    URLhttps://doi.org/10.1051/sands/2024019

    Xiang, Xiayu, Ma, Changchang, Zeng, Liyi, Feng, Wenying, Xie, Yushun, Gu, Zhaoquan, Uncovering multi-step attacks with threat knowledge graph reasoning, Security and Safety 4 (2025) 2024019.doi:10.1051/sands/ 2024019. URLhttps://doi.org/10.1051/sands/2024019

  35. [35]

    L. Du, C. Xu, Knowledge graph construction research from multi-source vulnerability intelligence, in: W. Lu, Y. Zhang, W. Wen, H. Yan, C. Li (Eds.), Cyber Security, Springer Nature Singapore, Singapore, 2022, pp. 177–184

  36. [36]

    Falcarin, F

    P. Falcarin, F. Dainese, Building a cybersecurity knowledge graph with cybergraph, in: Proceedings of the 2024 ACM/IEEE Workshops on En- CyCriS and Software Vulnerability, 2024, pp. 29–36, presents Cyber- Graph, a tool for automatic construction and querying of a cybersecu- rity KG; integrates data from diverse sources to assist security experts. doi:10.1...

  37. [37]

    K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mpnet: Masked and permuted pre-training for language understanding, arXiv preprint arXiv:2004.09297 (2020)

  38. [38]

    Rajagopalan, K

    A. Rajagopalan, K. Kandasamy, Y. Li, M. Egele, D. Marculescu, B. Viswanath, Secbert: A pretrained model for cybersecurity text mining, arXiv preprint arXiv:2101.04905 (2021)

  39. [39]

    Bojanowski, E

    P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computa- tional Linguistics 5 (2017) 135–146

  40. [40]

    J. Yin, M. Tang, J. Cao, H. Wang, M. You, Y. Lin, Adaptive online learning for vulnerability exploitation time prediction, in: Web Information Systems Engineering–WISE 2020: 21st International Conference, Amsterdam, The Netherlands, October 20–24, 2020, Proceedings, Part II 21, Springer, 2020, pp. 252–266

  41. [41]

    C. Yin, X. Yu, B. Yang, H. Zhang, J. Zhou, Vulnerability classification with bidirectional lstm network, in: Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), IEEE, 2020, pp. 2830–2836

  42. [42]

    Y. Li, T. Zhang, W. Meng, W. Lou, Neural embeddings for vulnerability assessment: An empirical study with real-world cves, IEEE Transactions on Information Forensics and Security 16 (2021) 3697–3712

  43. [43]

    Marie, ml-pca: Principal Component Analysis in JavaScript,https: //www.npmjs.com/package/ml-pca, accessed: 2025-04-05 (2020)

    N. Marie, ml-pca: Principal Component Analysis in JavaScript,https: //www.npmjs.com/package/ml-pca, accessed: 2025-04-05 (2020). 23

  44. [44]

    Lippi, G

    V. Lippi, G. Ceccarelli, Incremental principal component analysis: Ex- act implementation and continuity corrections, in: Proceedings of the 16th International Conference on Informatics in Control, Automation and Robotics, SCITEPRESS - Science and Technology Publications, 2019. doi:10.5220/0007743604730480. URLhttp://dx.doi.org/10.5220/0007743604730480

  45. [45]

    Face, Transformers documentation (2025)

    H. Face, Transformers documentation (2025). URLhttp://huggingface.co/docs/transformers/en/index

  46. [46]

    N. I. of Standards, T. (NIST), Nvd – data feeds,https://nvd.nist.gov/ vuln/data-feeds

  47. [47]

    Microsoft, Microsoft security bulletin ms17-010: Security update for windows smb server,https://docs.microsoft.com/en-us/security- updates/securitybulletins/2017/ms17-010(2017). 24