pith. sign in

arxiv: 2606.19076 · v1 · pith:KEEV2DGVnew · submitted 2026-06-17 · 💻 cs.CR

Compute-Budgeted Exploitability Evidence Graphs for Prospective Vulnerability Triage

Pith reviewed 2026-06-26 20:27 UTC · model grok-4.3

classification 💻 cs.CR
keywords vulnerability triageexploitability predictionprospective evaluationevidence graphsinformation leakageCVEssecurity prioritizationbudgeted selection
0
0 comments X

The pith

Budgeted selection from temporal evidence graphs raises leakage-safe prospective recall@50 for CVE exploitability from 0.010 to 0.026.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that exploitability predictions must be evaluated prospectively using only public evidence visible by a fixed decision time to avoid leakage from later sources. Public advisories, exploit archives, fix commits and discourse are modeled as temporal evidence graphs, and a budgeted selector admits only a few documents per CVE while producing an auditable certificate of signals, timestamps and leakage flags. On 12012 CVEs this raises leakage-safe recall@50 from a severity-only baseline of 0.010 to 0.026, with two documents capturing most of the value. A cross-encoder reranker actually lowers recall to 0.016, and naive random splits without temporal filtering inflate results by 8.5 times.

Core claim

The authors claim that assembling temporal evidence graphs, applying a compute budget to select a small number of supporting documents per CVE, and pairing every score with a certificate that lists supporting signals, timestamps, source layers and leakage flags produces higher leakage-safe prospective recall than severity baselines. On 12012 CVEs the method reaches recall@50 of 0.026 versus 0.010 for severity alone, most value is obtained with two documents, and a semantic reranker drops performance to 0.016. The same protocol shows that unfiltered random splits inflate apparent recall by 8.5 times and EPSS-high recall by 5.0 times.

What carries the argument

Temporal evidence graph with budgeted selector and auditable leakage-safe certificates

If this is right

  • Two evidence documents per CVE capture most of the performance gain over the severity baseline.
  • Semantic relevance to a CVE is not the same as evidence of exploitation, as shown by the reranker lowering recall.
  • Strict temporal constraints are required in evaluation protocols, since unfiltered random splits inflate recall by 8.5 times.
  • Auditable certificates enable contestable and reproducible claims about vulnerability prioritization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Vulnerability databases could adopt the certificate format to standardize how prioritization decisions are documented and audited.
  • Future selectors might improve further by weighting evidence layers differently rather than treating all documents equally.
  • The same budgeted graph approach could be tested on streaming CVE data to support ongoing triage rather than batch evaluation.
  • Organizations with internal exploit data might combine the public graph with private signals while preserving the leakage safeguards.

Load-bearing premise

Temporal evidence graphs can be assembled and decision timestamps chosen so that no future information leaks into any CVE score, and the 12012 CVEs form a representative sample of real-world prospective triage.

What would settle it

A replication study on a new cohort of CVEs with every evidence document timestamp strictly before its decision time shows no improvement over the severity baseline of 0.010 or reveals detectable future leakage in the assembled graphs.

Figures

Figures reproduced from arXiv: 2606.19076 by Faruk Alpay, Taylan Alpay.

Figure 1
Figure 1. Figure 1: Recall versus per-CVE evidence budget B. The curve saturates well below the maximum budget, showing that triage value is captured cheaply. The reranker does not help ( [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evidence-source ablations (KEV recall@50). The dashed line is the severity-only baseline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Random/leaky versus temporal/leakage-safe evaluation. The difference is the temporal leakage [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Defenders cannot patch every newly disclosed vulnerability at once, so exploitability prediction must be evaluated prospectively rather than retrospectively. We study compute-budgeted vulnerability triage in which each CVE is scored only from public evidence visible by a fixed decision time. Advisories, exploit archives, fix commits, and hacker-community discourse are represented as a temporal evidence graph; a budgeted selector admits only a few evidence documents per CVE, and every score is paired with an auditable certificate listing the supporting signals, timestamps, source layers, and leakage flags. On 12012 prospective CVEs from public sources, budgeted evidence selection raises leakage-safe prospective recall@50 from 0.010 for a severity-only baseline to 0.026, while two evidence documents per CVE capture most of the value. A strong cross-encoder reranker lowers prospective recall to 0.016, showing that semantic relevance to a CVE is not the same as evidence of exploitation. Most importantly, a naive random split with unfiltered evidence inflates apparent prospective recall by 8.5x and EPSS-high recall by 5.0x. The main contribution is a leakage-safe evaluation protocol and reproducible evidence certificates for contestable vulnerability-prioritization claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a compute-budgeted framework for prospective exploitability prediction in vulnerability triage. It constructs temporal evidence graphs from public sources (advisories, exploit archives, fix commits, hacker discourse) and applies a selector that limits the number of evidence documents per CVE while recording timestamps, source layers, and leakage flags in auditable certificates. Evaluated on 12012 prospective CVEs, the budgeted selector improves leakage-safe recall@50 from 0.010 (severity-only baseline) to 0.026; two documents per CVE capture most value, a cross-encoder reranker drops performance to 0.016, and naive random splits with unfiltered evidence inflate recall by 8.5x (EPSS-high by 5.0x). The primary contribution is the leakage-safe evaluation protocol and reproducible certificates.

Significance. If the temporal cutoffs are verifiably enforced, the work supplies a concrete, auditable protocol for leakage-free prospective evaluation that directly addresses a known methodological weakness in security ML. The contrast between semantic reranking and exploitation evidence, the demonstration of 8.5x inflation under naive splits, and the emphasis on certificates for contestable claims are all useful contributions. The budgeted approach showing that limited evidence suffices is practically relevant for triage systems.

major comments (2)
  1. [Abstract and evaluation protocol] Abstract and evaluation protocol: The headline result (recall@50 rising from 0.010 to 0.026 under leakage-safe conditions) and the 8.5x inflation claim both presuppose that every admitted evidence document has a timestamp strictly before its CVE's decision time and that decision times themselves are chosen without reference to later outcomes. The manuscript records leakage flags but supplies no description of the procedure used to fix decision timestamps per CVE or the pipeline steps that enforce the pre-decision cutoff across advisories, commits, and discourse sources for all 12012 CVEs. This detail is load-bearing for the central claim.
  2. [Dataset construction] § on dataset construction (prospective CVE selection): The 12012 CVEs are described as 'prospective' from public sources, yet no explicit criteria are given for how the decision time is assigned to each CVE or how the sample is ensured to be representative of real triage scenarios without hindsight. Without this, it is impossible to confirm that the reported lift is free of selection bias or post-hoc leakage.
minor comments (2)
  1. [Methods] The 'budgeted selector' is referenced repeatedly but never given a formal definition, pseudocode, or complexity bound; a short algorithmic description would improve reproducibility.
  2. [Results] Table or figure reporting the per-CVE evidence counts and leakage-flag statistics is missing; adding one would make the 'two documents capture most value' claim easier to verify.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for explicit, verifiable procedures in our leakage-safe prospective evaluation. The comments correctly note that additional detail on timestamp assignment and CVE selection criteria is required to fully substantiate the central claims. We will revise the manuscript to incorporate these clarifications in the evaluation protocol and dataset construction sections.

read point-by-point responses
  1. Referee: [Abstract and evaluation protocol] Abstract and evaluation protocol: The headline result (recall@50 rising from 0.010 to 0.026 under leakage-safe conditions) and the 8.5x inflation claim both presuppose that every admitted evidence document has a timestamp strictly before its CVE's decision time and that decision times themselves are chosen without reference to later outcomes. The manuscript records leakage flags but supplies no description of the procedure used to fix decision timestamps per CVE or the pipeline steps that enforce the pre-decision cutoff across advisories, commits, and discourse sources for all 12012 CVEs. This detail is load-bearing for the central claim.

    Authors: We agree that the procedure for fixing decision timestamps and enforcing pre-decision cutoffs must be described in detail for the leakage-safe claims to be verifiable. In the revised manuscript we will add a dedicated subsection under the evaluation protocol that specifies: decision time per CVE is set to the earliest public disclosure timestamp recorded in NVD or the primary vendor advisory; evidence documents are filtered by comparing their source timestamps against this decision time; and the certificate generation step records the outcome of this comparison as a leakage flag. We will also include a high-level pipeline diagram and a worked example for one CVE showing enforcement across all four evidence layers. revision: yes

  2. Referee: [Dataset construction] § on dataset construction (prospective CVE selection): The 12012 CVEs are described as 'prospective' from public sources, yet no explicit criteria are given for how the decision time is assigned to each CVE or how the sample is ensured to be representative of real triage scenarios without hindsight. Without this, it is impossible to confirm that the reported lift is free of selection bias or post-hoc leakage.

    Authors: The referee is correct that explicit selection criteria and decision-time rules are needed to demonstrate absence of hindsight bias. We will expand the dataset construction section to state that the 12012 CVEs comprise every entry whose NVD publication date falls inside a fixed temporal window chosen prior to any exploit labeling, with decision time defined uniformly as that NVD publication date. No CVE was included or excluded on the basis of later exploit status. We will also add a short discussion of how this sampling approximates operational triage and note remaining limitations on representativeness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in evaluation protocol or claims

full rationale

The paper reports empirical recall@50 results on a fixed set of 12012 CVEs using a leakage-safe temporal evidence graph protocol. No equations, fitted parameters, self-citations, or ansatzes are described that would make the reported lift (0.010 to 0.026) or the 8.5x inflation factor reduce to the inputs by construction. The evaluation is presented as an independent measurement against baselines and naive splits, with the central contribution being the protocol itself rather than a derived quantity forced by prior steps. The leakage assumption is an external validity concern, not a definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach appears to rest on standard graph construction and selection assumptions whose details are not visible.

pith-pipeline@v0.9.1-grok · 5742 in / 1233 out tokens · 29596 ms · 2026-06-26T20:27:32.397603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 2 linked inside Pith

  1. [1]

    Comparing vulnerability severity and exploits using case-control studies.ACM Transactions on Information and System Security (TISSEC), 17(1):1–20, 2014

    Luca Allodi and Fabio Massacci. Comparing vulnerability severity and exploits using case-control studies.ACM Transactions on Information and System Security (TISSEC), 17(1):1–20, 2014

  2. [2]

    Ampel, Sagar Samtani, Hongyi Zhu, and Hsinchun Chen

    Benjamin M. Ampel, Sagar Samtani, Hongyi Zhu, and Hsinchun Chen. Creating proactive cyber threat intelligence with hacker exploit labels: a deep transfer learning approach.MIS Quarterly, 48(1):137–166, 2024

  3. [3]

    Ampel et al

    Benjamin M. Ampel et al. HackerSignal: a large-scale multi-source dataset linking hacker community discourse to the CVE vulnerability lifecycle, 2026

  4. [4]

    CVEfixes: automated collection of vulnerabilities and their fixes from open-source software

    Guru Bhandari, Amara Naseer, and Leon Moonen. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. InProc. 17th Int. Conf. on Predictive Models and Data Analytics in Software Engineering (PROMISE), pages 30–39, 2021

  5. [5]

    Saul, Stefan Savage, and Geoffrey M

    Mehran Bozorgi, Lawrence K. Saul, Stefan Savage, and Geoffrey M. V oelker. Beyond heuristics: learning to classify vulnerabilities and predict exploits. InProc. 16th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), pages 105–114, 2010

  6. [6]

    Reducing the significant risk of known exploited vulnerabilities (binding operational directive 22-01) and the KEV catalog

    Cybersecurity and Infrastructure Security Agency (CISA). Reducing the significant risk of known exploited vulnerabilities (binding operational directive 22-01) and the KEV catalog. https://www.cisa.gov/ known-exploited-vulnerabilities-catalog, 2021

  7. [7]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProc. 34th Int. Conf. on Machine Learning (ICML), pages 1321–1330, 2017

  8. [8]

    Emanuele Iannone, Giulia Sellitto, Emanuele Iaccarino, Filomena Ferrucci, Andrea De Lucia, and Fabio Palomba. Early and realistic exploitability prediction of just-disclosed software vulnerabilities: how reliable can it be?ACM Transactions on Software Engineering and Methodology (TOSEM), 33(1):1–41, 2024

  9. [9]

    Improving vulnerability remediation through better exploit prediction.Journal of Cybersecurity, 6(1):tyaa015, 2020

    Jay Jacobs, Sasha Romanosky, Idris Adjerid, and Wade Baker. Improving vulnerability remediation through better exploit prediction.Journal of Cybersecurity, 6(1):tyaa015, 2020

  10. [10]

    Exploit prediction scoring system (epss).Digital Threats: Research and Practice, 2(3):1–17, 2021

    Jay Jacobs, Sasha Romanosky, Benjamin Edwards, Michael Roytman, and Idris Adjerid. Exploit prediction scoring system (epss).Digital Threats: Research and Practice, 2(3):1–17, 2021

  11. [11]

    Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2021

    Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2021

  12. [12]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020

  13. [13]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

  14. [14]

    Common vulnerability scoring system.IEEE Security & Privacy, 4(6):85–89, 2006

    Peter Mell, Karen Scarfone, and Sasha Romanosky. Common vulnerability scoring system.IEEE Security & Privacy, 4(6):85–89, 2006

  15. [15]

    Common weakness enumeration (CWE).https://cwe.mitre.org/, 2024

    MITRE Corporation. Common weakness enumeration (CWE).https://cwe.mitre.org/, 2024. 8

  16. [16]

    Predicting good probabilities with supervised learning

    Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProc. 22nd Int. Conf. on Machine Learning (ICML), pages 625–632, 2005

  17. [17]

    Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

    Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

  18. [18]

    TESSER- ACT: eliminating experimental bias in malware classification across space and time

    Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. TESSER- ACT: eliminating experimental bias in malware classification across space and time. In28th USENIX Security Symposium, pages 729–746, 2019

  19. [19]

    Sentence-BERT: sentence embeddings using Siamese BERT-networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proc. 2019 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pages 3982–3992, 2019

  20. [20]

    Vulnerability disclosure in the age of social media: exploiting Twitter for predicting real-world exploits

    Carl Sabottke, Octavian Suciu, and Tudor Dumitras ¸. Vulnerability disclosure in the age of social media: exploiting Twitter for predicting real-world exploits. In24th USENIX Security Symposium, pages 1041–1056, 2015

  21. [21]

    Expected exploitability: predicting the development of functional vulnerability exploits

    Octavian Suciu, Connor Nelson, Zhuoer Lyu, Tiffany Bao, and Tudor Dumitras ¸. Expected exploitability: predicting the development of functional vulnerability exploits. In31st USENIX Security Symposium, pages 377–394, 2022

  22. [22]

    DarkEmbed: exploit prediction with neural language models

    Nazgol Tavabi, Palash Goyal, Mohammed Almukaynizi, Paulo Shakarian, and Kristina Lerman. DarkEmbed: exploit prediction with neural language models. InProc. AAAI Conf. on Artificial Intelligence (IAAI), 2018

  23. [23]

    Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533, 2022. A Engineering the Pipeline for Budgeted GPU Throughput This appendix documents the systems issues encountered in making the study tractable on a si...