pith. machine review for the scientific record. sign in

arxiv: 2605.09461 · v2 · submitted 2026-05-10 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords vulnerability detectionLLMcontext augmentationsoftware securityCWEcode semanticsAST CFG DFG
0
0 comments X

The pith

Triple context paths improve LLM vulnerability detection

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VulTriage to help large language models find security flaws in source code by adding three different kinds of background information to the model's input. One path describes the code's control and data dependencies using its structure graphs. Another retrieves examples of similar known vulnerabilities. The third summarizes the code's intended function. The authors claim this leads to better detection of subtle vulnerabilities than using raw code or other methods.

Core claim

VulTriage augments LLM input for vulnerability detection with three paths: Control Path verbalizing AST, CFG and DFG for dependencies; Knowledge Path retrieving CWE patterns via hybrid dense-sparse retrieval; Semantic Path summarizing functional behavior. Integrated into one instruction, this produces state-of-the-art performance on the PrimeVul pair test set, beating deep learning and LLM baselines on pair-wise and classification metrics.

What carries the argument

The triple-path context augmentation with Control Path for structural info, Knowledge Path for CWE retrieval, and Semantic Path for behavior summary.

If this is right

  • Outperforms deep learning and LLM baselines on key metrics for vulnerability detection.
  • Ablation studies confirm each of the three paths adds value.
  • Demonstrates generalization on the Kotlin dataset under low-resource and imbalanced conditions.
  • Helps reduce missed vulnerabilities and false alarms from subtle code differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar augmentation could help LLMs on related tasks such as bug finding or code review.
  • The approach may need adjustments if the underlying LLM changes or for very large codebases.
  • Integration into IDEs could provide real-time vulnerability feedback during coding.
  • Retrieval quality for rare vulnerabilities remains a potential limit to effectiveness.

Load-bearing premise

The three context paths supply complementary information that the LLM integrates without noise or hallucinations and the retrieval finds relevant patterns.

What would settle it

A dataset of code examples where the three paths give inconsistent signals about vulnerability and the model fails to detect correctly more often than baselines.

Figures

Figures reproduced from arXiv: 2605.09461 by Jingyu Xiao, Jinlong Yang, Junliang Liu, Lei Wang, Peng Xiangli, Qing Li, Wang Luo, Wenxin Tang, Xiang Zhang, Xi Xiao, Yuehe Ma, Zhengheng Li, Zhenyu Liu, Zicheng Wang.

Figure 1
Figure 1. Figure 1: The framework of VulTriage. CodeBERT Feng et al. [2020], CodeT5 Wang et al. [2021], UniXCoder Guo et al. [2022], and LineVul Fu and Tantithamthavorn [2022] for vulnerability prediction. Some later studies further enhance semantic representations through comments Wen et al. [2024], identifier perturbation, counterfactual augmentation Kuang et al. [2023], or noisy-label learning Wen et al. [2023]. Structure￾… view at source ↗
Figure 2
Figure 2. Figure 2: Attention score analysis of different input blocks. E., M., and L. denote Early, Middle, and [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Running-example source code of the copy_bytes function. Tast — AST fragment produced by τ Function copy_bytes@L1: 2 declarations, 1 assignments, 1 branches, 1 calls. Key call chain: memcpy. Conditions/Loops: if(len > max) [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: CFG fragment Tcfg produced by τ on the running example. Tdfg — DFG fragment produced by τ Function copy_bytes: edges retained 3/4; parameter sources 2; chains 1. Parameter sources: param:len@L1; param:src@L1. Data chain: param:len → if(len > max) → call memcpy [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DFG fragment Tdfg produced by τ on the running example. F Prompt Templates This section provides the prompt templates used in VulTriage. Each template corresponds to one LLM-based agent call in the framework. The placeholders enclosed by angle brackets are replaced with the corresponding input content during inference. F.1 Knowledge Path: Vulnerability Query Generation The Knowledge Path first asks the LLM… view at source ↗
read the original abstract

Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the structural dependencies, domain-specific vulnerability knowledge, and complex program semantics required for accurate detection. Recent Large Language Models (LLMs) have shown strong code understanding ability, but directly prompting them with raw source code often leads to missed vulnerabilities or false alarms, especially when vulnerable and benign functions differ only in subtle semantic details. To address this, we propose VulTriage, a triple-path context augmentation framework for LLM-based vulnerability detection. VulTriage enhances the LLM input through three complementary paths: a Control Path that extracts and verbalizes AST, CFG, and DFG information to expose control and data dependencies; a Knowledge Path that retrieves relevant CWE-derived vulnerability patterns and examples through hybrid dense--sparse retrieval; and a Semantic Path that summarizes the functional behavior of the code before the final judgment. These contexts are integrated into a unified instruction to guide the LLM toward more reliable vulnerability reasoning. Experiments on the PrimeVul pair test set show that VulTriage achieves state-of-the-art performance, outperforming existing deep learning and LLM-based baselines on key pair-wise and classification metrics. Further ablation studies verify the effectiveness of each path, and additional experiments on the Kotlin dataset demonstrate the generalization ability of VulTriage under low-resource and class-imbalanced settings. Our code is available at https://github.com/vinsontang1/VulTriage

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes VulTriage, a triple-path context augmentation framework for LLM-based vulnerability detection. It augments LLM inputs via a Control Path (verbalizing AST/CFG/DFG dependencies), a Knowledge Path (hybrid dense-sparse retrieval of CWE patterns and examples), and a Semantic Path (functional behavior summary), which are fused into a unified instruction prompt. The central claim is that this yields SOTA results on the PrimeVul pair test set, outperforming DL and LLM baselines on pair-wise and classification metrics, with ablations confirming each path's value and further tests showing generalization to Kotlin under low-resource conditions.

Significance. If the results hold, the work provides a concrete, reproducible method for injecting structural, knowledge-base, and semantic context into LLMs for security tasks, addressing a known weakness of raw-code prompting. The open-source release at the cited GitHub link is a positive factor for verification and extension.

major comments (2)
  1. [§4.2, Table 3] §4.2 (Main Results) and Table 3: The SOTA claim on PrimeVul pair-wise metrics is presented without retrieval-quality metrics (precision@K or fraction of relevant CWEs returned) for the Knowledge Path on the test samples. This is load-bearing for the central claim because the skeptic concern is that observed gains could stem from generic prompt expansion rather than accurate CWE fusion; without these numbers the complementarity of the three paths cannot be confirmed.
  2. [§4.3] §4.3 (Ablation Studies): The reported performance drops when ablating each path are given, yet the paper does not include qualitative retrieval examples or error analysis showing that the Knowledge Path actually surfaces relevant patterns on PrimeVul functions where the baseline LLM fails. This leaves open the possibility that gains are driven by the Control or Semantic paths alone.
minor comments (3)
  1. [§3.4] The prompt template in §3.4 uses inconsistent notation for the three context blocks; a single figure or boxed example would improve clarity.
  2. [Figure 2] Figure 2 (overall architecture) lacks axis labels on the retrieval component and does not indicate the exact top-K value used in experiments.
  3. [§2] A few citations to prior CWE-retrieval work (e.g., the hybrid retriever baseline) are missing from the related-work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of evidence presentation that we will address through targeted revisions to strengthen the support for our claims regarding the Knowledge Path's contribution.

read point-by-point responses
  1. Referee: [§4.2, Table 3] §4.2 (Main Results) and Table 3: The SOTA claim on PrimeVul pair-wise metrics is presented without retrieval-quality metrics (precision@K or fraction of relevant CWEs returned) for the Knowledge Path on the test samples. This is load-bearing for the central claim because the skeptic concern is that observed gains could stem from generic prompt expansion rather than accurate CWE fusion; without these numbers the complementarity of the three paths cannot be confirmed.

    Authors: We agree that retrieval-quality metrics would provide stronger confirmation that gains arise from accurate CWE fusion rather than generic expansion. In the revised manuscript we will add precision@K, recall@K, and the fraction of relevant CWEs retrieved, computed on the PrimeVul test set using the same hybrid dense-sparse setup described in §3.2. These numbers will be reported alongside the main results in an updated Table 3 and discussed in §4.2 to directly address the complementarity concern. revision: yes

  2. Referee: [§4.3] §4.3 (Ablation Studies): The reported performance drops when ablating each path are given, yet the paper does not include qualitative retrieval examples or error analysis showing that the Knowledge Path actually surfaces relevant patterns on PrimeVul functions where the baseline LLM fails. This leaves open the possibility that gains are driven by the Control or Semantic paths alone.

    Authors: We acknowledge that qualitative evidence would further substantiate the Knowledge Path's role. The revised version will add a dedicated qualitative analysis subsection (new §4.3.1) containing (i) concrete examples of retrieved CWE patterns and examples for PrimeVul functions where the baseline LLM fails but the full model succeeds, and (ii) a short error analysis of retrieval relevance on those cases. This material will be drawn from the existing retrieval logs and will be included without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks with independent ablations

full rationale

The paper describes a triple-path augmentation framework (Control via AST/CFG/DFG verbalization, Knowledge via hybrid CWE retrieval, Semantic summarization) whose central claims are supported by experiments on the external PrimeVul pair test set and a separate Kotlin dataset. No equations, parameter-fitting loops, or self-referential definitions appear in the provided text. Ablation studies are presented as verification of each path's contribution, and code release allows external reproduction. Performance is reported as outperforming baselines rather than being forced by construction from inputs. No self-citation load-bearing steps or uniqueness theorems are invoked. This is a standard empirical ML paper whose derivation chain does not reduce to its own fitted values or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can reliably reason over the provided verbalized contexts without additional training; no free parameters are explicitly named in the abstract, but prompting choices and retrieval hyperparameters are implicit.

axioms (1)
  • domain assumption LLMs possess sufficient code understanding to benefit from explicit control-flow and knowledge augmentation without hallucinating new vulnerabilities.
    Invoked in the motivation and method description; no formal justification supplied in the abstract.

pith-pipeline@v0.9.0 · 5600 in / 1262 out tokens · 47039 ms · 2026-05-13T07:26:33.140350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

  1. [1]

    https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624

    Vulnerability Detection with Code Language Models: How Far Are We? , author=. arXiv preprint arXiv:2403.18624 , year=

  2. [2]

    Advances in neural information processing systems , volume=

    Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks , author=. Advances in neural information processing systems , volume=

  3. [3]

    IEEE Transactions on Software Engineering , volume=

    Deep learning based vulnerability detection: Are we there yet? , author=. IEEE Transactions on Software Engineering , volume=. 2021 , publisher=

  4. [4]

    Findings of the association for computational linguistics: EMNLP 2020 , pages=

    Codebert: A pre-trained model for programming and natural languages , author=. Findings of the association for computational linguistics: EMNLP 2020 , pages=

  5. [5]

    2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

    Data quality for software vulnerability datasets , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

  6. [6]

    Proceedings of the 19th international conference on mining software repositories , pages=

    Linevul: A transformer-based line-level vulnerability prediction , author=. Proceedings of the 19th international conference on mining software repositories , pages=

  7. [7]

    Proceedings of the 33rd ACM SIGSOFT international symposium on software testing and analysis , pages=

    Scale: Constructing structured natural language comment trees for software vulnerability detection , author=. Proceedings of the 33rd ACM SIGSOFT international symposium on software testing and analysis , pages=

  8. [8]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Why think step by step? reasoning emerges from the locality of experience , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  11. [11]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation , author=. arXiv preprint arXiv:2402.03216 , volume=

  12. [12]

    Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE) , year =

    Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents , author =. Proceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE) , year =

  13. [13]

    Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

    Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation , author=. Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

  14. [14]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Unixcoder: Unified cross-modal pre-training for code representation , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  15. [15]

    2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA) , pages=

    Large language model-powered smart contract vulnerability detection: New perspectives , author=. 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA) , pages=. 2023 , organization=

  16. [16]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    A Sequential Multi-Stage Approach for Code Vulnerability Detection via Confidence-and Collaboration-based Decision Making , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  17. [17]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  18. [18]

    2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) , pages=

    Leveraging user-defined identifiers for counterfactual data generation in source code vulnerability detection , author=. 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) , pages=. 2023 , organization=

  19. [19]

    2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

    When less is enough: Positive and unlabeled learning model for vulnerability detection , author=. 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2023 , organization=

  20. [20]

    Proceedings of the 15th Asia-Pacific Symposium on Internetware , pages=

    Dfept: data flow embedding for enhancing pre-trained model based vulnerability detection , author=. Proceedings of the 15th Asia-Pacific Symposium on Internetware , pages=

  21. [21]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

    Pre-training by predicting program dependencies for vulnerability analysis tasks , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

  22. [22]

    Proceedings of the 15th Asia-Pacific Symposium on Internetware , pages=

    Matsvd: Boosting statement-level vulnerability detection via dependency-based attention , author=. Proceedings of the 15th Asia-Pacific Symposium on Internetware , pages=

  23. [23]

    IEEE Transactions on Software Engineering , volume=

    Vulnerability detection by learning from syntax-based execution paths of code , author=. IEEE Transactions on Software Engineering , volume=. 2023 , publisher=

  24. [24]

    2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) , pages=

    PTLVD: Program slicing and transformer-based line-level vulnerability detection system , author=. 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM) , pages=. 2023 , organization=

  25. [25]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

    Combining structured static code information and dynamic symbolic traces for software vulnerability prediction , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

  26. [26]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Slidecoder: Layout-aware rag-enhanced hierarchical slide generation from design , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  27. [27]

    arXiv preprint arXiv:2603.00155 , year=

    EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection , author=. arXiv preprint arXiv:2603.00155 , year=

  28. [28]

    2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=

    Interaction2code: Benchmarking mllm-based interactive webpage code generation from interactive prototyping , author=. 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE) , pages=. 2025 , organization=

  29. [29]

    arXiv preprint arXiv:2509.21782 , year=

    Benchmarking mllm-based web understanding: Reasoning, robustness and safety , author=. arXiv preprint arXiv:2509.21782 , year=

  30. [30]

    arXiv preprint arXiv:2506.06251 , year=

    Designbench: A comprehensive benchmark for mllm-based front-end code generation , author=. arXiv preprint arXiv:2506.06251 , year=

  31. [31]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  32. [32]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    Queryattack: Jailbreaking aligned large language models using structured non-natural query language , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  33. [33]

    2023 30th Asia-Pacific Software Engineering Conference (APSEC) , pages=

    Chatgpt for vulnerability detection, classification, and repair: How far are we? , author=. 2023 30th Asia-Pacific Software Engineering Conference (APSEC) , pages=. 2023 , organization=

  34. [34]

    Bendersky, Eli , howpublished =

  35. [35]

    Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=

    Software vulnerability prediction in low-resource languages: An empirical study of codebert and chatgpt , author=. Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages=

  36. [36]

    Psychometrika , volume=

    Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=