pith. machine review for the scientific record. sign in

arxiv: 2605.03697 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.AI

Recognition: unknown

Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords smart contractsvulnerability detectionlarge language modelsblockchain securityprompt engineeringabstract syntax treesecurity analysis
0
0 comments X

The pith

An LLM framework using AST context and tailored prompts detects 13 smart contract vulnerability types at 0.92 positive recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops an LLM-based method for finding security flaws in smart contracts on blockchains. It builds and releases a dataset of 31,165 professionally annotated vulnerability instances from over 3,200 real projects across 15 platforms. The method uses abstract syntax tree analysis to extract relevant code context and designs specific prompts for each of 13 common vulnerability categories to create customized detectors. Experiments show average positive recall of 0.92 and negative recall of 0.85, offering a flexible alternative to manual rule-based approaches for a domain where exploits can cause irreversible financial damage.

Core claim

By leveraging precise AST-based context extraction and vulnerability-specific prompt design, customized LLM detectors can be instantiated for 13 prevalent smart contract vulnerability categories, achieving an average positive recall of 0.92 and an average negative recall of 0.85 on a dataset of 31,165 annotated instances from over 3,200 real-world projects.

What carries the argument

Vulnerability-specific prompt design combined with AST-based context extraction, which supplies the LLM with targeted code snippets and instructions for each vulnerability category.

Load-bearing premise

The 31,165 professionally annotated instances accurately represent real-world vulnerabilities without labeling errors or bias, and LLM outputs remain reliable on unseen contracts without high rates of missed issues or false alarms.

What would settle it

Evaluating the detectors on a fresh collection of smart contracts containing documented vulnerabilities from recent exploits and measuring whether positive recall stays near 0.92 and negative recall near 0.85.

Figures

Figures reproduced from arXiv: 2605.03697 by Anbang Ruan, Keyu Zhang, Taohong Zhu, Xing Zhang.

Figure 1
Figure 1. Figure 1: Overview of the proposed LLM-based smart contract view at source ↗
Figure 2
Figure 2. Figure 2: AST information extraction overview whereas timestamp-dependence vulnerabilities arise in func￾tions that rely on block.timestamp for critical computa￾tions or control-flow decisions. Guided by these observations and expert knowledge from real-world auditing practice, the fine-grained extraction stage extracts call-stack functions and associated contextual information most relevant to the targeted vulnerab… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template design view at source ↗
read the original abstract

Smart contracts on blockchains are prone to diverse security vulnerabilities that can lead to significant financial losses due to their immutable nature. Existing detection approaches often lack flexibility across vulnerability types and rely heavily on manually crafted expert rules. In this paper, we present an LLM-based framework for practical smart contract vulnerability detection. We construct and release a large-scale dataset comprising 31,165 professionally annotated vulnerability instances collected from over 3,200 real-world projects across 15 major blockchain platforms. Our approach leverages precise AST-based context extraction and vulnerability-specific prompt design to instantiate customized detectors for 13 prevalent vulnerability categories. Experimental results demonstrate strong effectiveness, achieving an average positive recall of 0.92 and an average negative recall of 0.85, highlighting the potential of carefully engineered contextual prompting for scalable and high-precision smart contract security analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces an LLM-based framework for smart contract vulnerability detection that uses AST-based context extraction and vulnerability-specific prompt design to create customized detectors for 13 prevalent vulnerability categories. It constructs and releases a dataset of 31,165 professionally annotated instances drawn from over 3,200 real-world projects across 15 blockchain platforms. The central empirical claim is that the approach achieves strong performance, with an average positive recall of 0.92 and average negative recall of 0.85.

Significance. If the reported recalls prove robust under proper held-out evaluation, the work would be significant for offering a flexible, prompt-engineered alternative to rigid rule-based or static-analysis tools in smart-contract security. A clear strength is the construction and public release of a large-scale, multi-platform annotated dataset, which can serve as a reusable benchmark and addresses a common data scarcity issue in the field. The vulnerability-specific prompting strategy also illustrates a practical way to adapt general LLMs to domain-specific detection tasks without full fine-tuning.

major comments (3)
  1. The Experimental Evaluation section (and abstract) reports average positive recall of 0.92 and negative recall of 0.85 but provides no information on the train/test split of the 31,165 instances, whether prompt engineering and template selection were performed exclusively on training data, or results on contracts from entirely held-out projects/platforms. This is load-bearing for the central effectiveness claim, because without explicit separation the metrics could reflect prompt overfitting or label leakage rather than generalization.
  2. The dataset construction description lacks any account of the professional annotation protocol, including inter-annotator agreement statistics, expert review process, or steps taken to mitigate labeling errors and bias. Because the recalls are computed against these labels, the absence of validation details directly affects the reliability of the headline numbers and the claim that the instances “accurately represent real-world vulnerabilities.”
  3. No baseline comparisons (e.g., to established static analyzers such as Slither or Mythril, or to prior LLM-based detectors) are presented alongside the internal recall figures. Without such comparisons it is impossible to determine whether the tailored-prompt approach advances beyond existing methods or simply reproduces known performance on the same data.
minor comments (2)
  1. The abstract and introduction would benefit from a brief statement of the exact 13 vulnerability categories and the precise definitions of positive/negative recall used in the averages.
  2. The paper should include a limitations paragraph discussing potential LLM-specific issues such as hallucination rates on unseen contract patterns and inference cost at scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: The Experimental Evaluation section (and abstract) reports average positive recall of 0.92 and negative recall of 0.85 but provides no information on the train/test split of the 31,165 instances, whether prompt engineering and template selection were performed exclusively on training data, or results on contracts from entirely held-out projects/platforms. This is load-bearing for the central effectiveness claim, because without explicit separation the metrics could reflect prompt overfitting or label leakage rather than generalization.

    Authors: We agree that the absence of these details is a material shortcoming that weakens the central claim. The current manuscript does not describe the split or confirm that prompt engineering was restricted to training data. We will revise the Experimental Evaluation section to document the partitioning procedure (including any project- or platform-level separation), state that all prompt design occurred on training data only, and add results on contracts drawn from entirely held-out projects and platforms. revision: yes

  2. Referee: The dataset construction description lacks any account of the professional annotation protocol, including inter-annotator agreement statistics, expert review process, or steps taken to mitigate labeling errors and bias. Because the recalls are computed against these labels, the absence of validation details directly affects the reliability of the headline numbers and the claim that the instances “accurately represent real-world vulnerabilities.”

    Authors: We concur that a full account of the annotation protocol is required to support the reliability of the reported metrics. The manuscript currently provides only a high-level statement that the instances are “professionally annotated.” We will add a dedicated subsection describing the annotation guidelines, the number and qualifications of annotators, inter-annotator agreement statistics, the multi-stage expert review process, and the specific measures used to reduce labeling errors and bias. revision: yes

  3. Referee: No baseline comparisons (e.g., to established static analyzers such as Slither or Mythril, or to prior LLM-based detectors) are presented alongside the internal recall figures. Without such comparisons it is impossible to determine whether the tailored-prompt approach advances beyond existing methods or simply reproduces known performance on the same data.

    Authors: We agree that direct comparisons to established tools are necessary to situate the contribution. The manuscript presents only the internal recall figures of the proposed framework. We will revise the Experimental Evaluation section to include side-by-side results against Slither, Mythril, and representative prior LLM-based detectors on the same 31,165-instance dataset and the same 13 vulnerability categories. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on annotated dataset with no self-referential derivations

full rationale

The paper presents an empirical framework: it constructs a dataset of 31,165 annotated instances from real-world projects, applies AST-based context extraction, designs vulnerability-specific prompts for 13 categories, and reports experimental recalls (0.92 positive, 0.85 negative). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations that bear the central load are present in the provided text. The claims rest on dataset construction and prompt engineering evaluated via standard metrics rather than reducing by definition or construction to the inputs themselves. This is a standard applied ML paper structure with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality and representativeness of the custom dataset and the effectiveness of manually designed prompts, which are not independently verified beyond the reported recalls in the abstract.

free parameters (1)
  • Vulnerability-specific prompt templates
    The exact wording and design of the 13 tailored prompts are engineered choices that directly influence detector performance but are not detailed or shown to be derived from first principles.
axioms (1)
  • domain assumption The professionally annotated dataset of 31,165 instances accurately captures real-world smart contract vulnerabilities without significant errors or selection bias.
    Performance metrics depend entirely on the correctness of these labels collected from over 3,200 projects.

pith-pipeline@v0.9.0 · 5441 in / 1402 out tokens · 64239 ms · 2026-05-07T15:36:51.297157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    An empirical analysis of smart contracts: platforms, applications, and design patterns,

    M. Bartoletti and L. Pompianu, “An empirical analysis of smart contracts: platforms, applications, and design patterns,” inFinancial Cryptography and Data Security: FC 2017 International Workshops, WAHC, BITCOIN, VOTING, WTSC, and TA, Sliema, Malta, April 7, 2017, Revised Selected Papers 21, pp. 494–509, Springer, 2017

  2. [2]

    A systematic literature review of blockchain and smart contract development: Tech- niques, tools, and open challenges,

    A. Vacca, A. Di Sorbo, C. A. Visaggio, and G. Canfora, “A systematic literature review of blockchain and smart contract development: Tech- niques, tools, and open challenges,”Journal of Systems and Software, vol. 174, p. 110891, 2021

  3. [3]

    Smart contracts vulnerabilities: a call for blockchain software engineering?,

    G. Destefanis, M. Marchesi, M. Ortu, R. Tonelli, A. Bracciali, and R. Hierons, “Smart contracts vulnerabilities: a call for blockchain software engineering?,” in2018 International Workshop on Blockchain Oriented Software Engineering (IWBOSE), pp. 19–25, IEEE, 2018

  4. [4]

    Sok: Decentralized finance (defi) attacks,

    L. Zhou, X. Xiong, J. Ernstberger, S. Chaliasos, Z. Wang, Y . Wang, K. Qin, R. Wattenhofer, D. Song, and A. Gervais, “Sok: Decentralized finance (defi) attacks,” in2023 IEEE Symposium on Security and Privacy (SP), pp. 2444–2461, IEEE, 2023

  5. [5]

    Blockchain smart contracts formalization: Approaches and challenges to address vulnerabilities,

    A. Singh, R. M. Parizi, Q. Zhang, K.-K. R. Choo, and A. Dehghantanha, “Blockchain smart contracts formalization: Approaches and challenges to address vulnerabilities,”Computers & Security, vol. 88, p. 101654, 2020

  6. [6]

    Demystifying exploitable bugs in smart contracts,

    Z. Zhang, B. Zhang, W. Xu, and Z. Lin, “Demystifying exploitable bugs in smart contracts,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 615–627, IEEE, 2023

  7. [7]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  8. [8]

    A Survey on Large Language Models for Code Generation

    J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”arXiv preprint arXiv:2406.00515, 2024

  9. [9]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,et al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

  10. [10]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  11. [11]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022

  12. [12]

    An overview of smart contract: architecture, applications, and future trends,

    S. Wang, Y . Yuan, X. Wang, J. Li, R. Qin, and F.-Y . Wang, “An overview of smart contract: architecture, applications, and future trends,” in2018 IEEE intelligent vehicles symposium (IV), pp. 108–113, IEEE, 2018

  13. [13]

    Smart contract vulnerability detection technique: A survey,

    P. Qian, Z. Liu, Q. He, B. Huang, D. Tian, and X. Wang, “Smart contract vulnerability detection technique: A survey,”arXiv preprint arXiv:2209.05872, 2022

  14. [14]

    A semantic frame- work for the security analysis of ethereum smart contracts,

    I. Grishchenko, M. Maffei, and C. Schneidewind, “A semantic frame- work for the security analysis of ethereum smart contracts,” inInter- national conference on principles of security and trust, pp. 243–269, Springer, 2018

  15. [15]

    Kevm: A complete formal semantics of the ethereum virtual machine,

    E. Hildenbrandt, M. Saxena, N. Rodrigues, X. Zhu, P. Daian, D. Guth, B. Moore, D. Park, Y . Zhang, A. Stefanescu,et al., “Kevm: A complete formal semantics of the ethereum virtual machine,” in2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 204–217, IEEE, 2018

  16. [16]

    Making smart contracts smarter,

    L. Luu, D.-H. Chu, H. Olickel, P. Saxena, and A. Hobor, “Making smart contracts smarter,” inProceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 254–269, 2016

  17. [17]

    A framework for bug hunting on the ethereum blockchain,

    B. Mueller, “A framework for bug hunting on the ethereum blockchain,” ConsenSys/mythril, 2017

  18. [18]

    Contractfuzzer: Fuzzing smart con- tracts for vulnerability detection,

    B. Jiang, Y . Liu, and W. K. Chan, “Contractfuzzer: Fuzzing smart con- tracts for vulnerability detection,” inProceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp. 259– 269, 2018

  19. [19]

    Reguard: finding reentrancy bugs in smart contracts,

    C. Liu, H. Liu, Z. Cao, Z. Chen, B. Chen, and B. Roscoe, “Reguard: finding reentrancy bugs in smart contracts,” inProceedings of the 40th international conference on software engineering: companion proceeed- ings, pp. 65–68, 2018

  20. [20]

    Slither: a static analysis framework for smart contracts,

    J. Feist, G. Grieco, and A. Groce, “Slither: a static analysis framework for smart contracts,” in2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), pp. 8–15, IEEE, 2019

  21. [21]

    Vandal: A scalable security analysis framework for smart contracts,

    L. Brent, A. Jurisevic, M. Kong, E. Liu, F. Gauthier, V . Gramoli, R. Holz, and B. Scholz, “Vandal: A scalable security analysis framework for smart contracts,”arXiv preprint arXiv:1809.03981, 2018

  22. [22]

    Towards safer smart contracts: A sequence learning approach to detecting security threats,

    W. J.-W. Tann, X. J. Han, S. S. Gupta, and Y .-S. Ong, “Towards safer smart contracts: A sequence learning approach to detecting security threats,”arXiv preprint arXiv:1811.06632, 2018

  23. [23]

    Smart contract vulnerability detection using graph neural networks,

    Y . Zhuang, Z. Liu, P. Qian, Q. Liu, X. Wang, and Q. He, “Smart contract vulnerability detection using graph neural networks,” inProceedings of the twenty-ninth international conference on international joint confer- ences on artificial intelligence, pp. 3283–3290, 2021

  24. [24]

    Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis,

    Y . Sun, D. Wu, Y . Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y . Liu, “Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13, 2024

  25. [25]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman,et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  26. [26]

    Chatgpt for good? on opportunities and challenges of large language models for education,

    E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier,et al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and individual differences, vol. 103, p. 102274, 2023

  27. [27]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong,et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

  28. [28]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”arXiv preprint arXiv:2402.07927, 2024

  29. [29]

    Smartguard: An llm- enhanced framework for smart contract vulnerability detection,

    H. Ding, Y . Liu, X. Piao, H. Song, and Z. Ji, “Smartguard: An llm- enhanced framework for smart contract vulnerability detection,”Expert Systems with Applications, vol. 269, p. 126479, 2025

  30. [30]

    Smartllmsentry: A comprehensive llm based smart contract vulnerability detection framework,

    O. Zaazaa and H. El Bakkali, “Smartllmsentry: A comprehensive llm based smart contract vulnerability detection framework,”Journal of Metaverse, vol. 4, no. 2, pp. 126–137, 2024

  31. [31]

    Vulnhunt-gpt: a smart contract vulnerabilities detector based on openai chatgpt,

    B. Boi, C. Esposito, and S. Lee, “Vulnhunt-gpt: a smart contract vulnerabilities detector based on openai chatgpt,” inProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, pp. 1517–1524, 2024

  32. [32]

    Automated smart contract vulnerability detection using fine-tuned large language models,

    Z. Yang, G. Man, and S. Yue, “Automated smart contract vulnerability detection using fine-tuned large language models,” inProceedings of the 2023 6th International Conference on Blockchain Technology and Applications, pp. 19–23, 2023

  33. [33]

    A context-driven approach for co-auditing smart contracts with the support of gpt-4 code interpreter,

    M. S. Bouafif, C. Zheng, I. A. Qasse, E. Zulkoski, M. Hamdaqa, and F. Khomh, “A context-driven approach for co-auditing smart contracts with the support of gpt-4 code interpreter,”arXiv preprint arXiv:2406.18075, 2024

  34. [34]

    Vulnerability detector

    Anonymous, “Vulnerability detector.” Available at (URL removed for double-blind review), 2026. Accessed: 2026-01-13

  35. [35]

    python-solidity-parser

    ConsenSys Diligence, “python-solidity-parser.” Available at github.com/ConsenSysDiligence/python-solidity-parser