arxiv: 2604.15375 · v1 · submitted 2026-04-15 · 💻 cs.AR · cs.AI· cs.CR

Recognition: unknown

VeriCWEty: Embedding enabled Line-Level CWE Detection in Verilog

Prithwish Basu Roy , Zeng Wang , Anatolii Chuvashlov , Weihua Xiao , Johann Knechtel , Ozgur Sinanoglu , Ramesh Karri

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:40 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.CR

keywords verilogcwe detectionline-level localizationembeddingsrtl securityhardware vulnerabilitiesbug detectionsemantic analysis

0 comments

The pith

An embedding-based framework detects and localizes common vulnerabilities in Verilog code at the line level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that embeddings derived from Verilog RTL code can detect and classify weaknesses at both module and line granularity. This addresses the shortfalls of rule-based checks and formal methods, which often miss semantic issues or provide only coarse locations. The approach targets vulnerabilities that appear in code generated by large language models. Reported performance reaches 89 percent precision on CWEs such as CWE-1244 and CWE-1245 together with 96 percent accuracy for line-level detection.

Core claim

The central claim is that an embedding-based bug-detection framework detects and classifies bugs in Verilog code at module and line-level granularity. It achieves about 89 percent precision in identifying common CWEs such as CWE-1244 and CWE-1245, and 96 percent accuracy in detecting line-level bugs.

What carries the argument

The embedding-based bug-detection framework that turns Verilog code into vectors to capture semantic vulnerabilities and enable line-level localization.

If this is right

Line-level localization becomes feasible for semantic bugs that evade rule-based checks.
Module-level and line-level classification of specific CWEs is performed in one pass.
Security review of LLM-generated RTL code gains a practical detection tool.
Detection works on both module and individual line granularity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same embedding approach could be tested on other hardware description languages to check generality.
Combining the embeddings with existing formal properties might raise overall detection rates.
The method supplies a concrete baseline for comparing future line-level hardware vulnerability tools.

Load-bearing premise

Embedding vectors derived from Verilog code can reliably capture semantic vulnerabilities and enable precise line-level localization where rule-based and formal methods fail.

What would settle it

A held-out set of Verilog modules containing documented instances of CWE-1244 or CWE-1245 at known lines where the embedding method returns precision below 80 percent or line-level accuracy below 90 percent.

Figures

Figures reproduced from arXiv: 2604.15375 by Anatolii Chuvashlov, Johann Knechtel, Ozgur Sinanoglu, Prithwish Basu Roy, Ramesh Karri, Weihua Xiao, Zeng Wang.

**Figure 2.** Figure 2: End-to-end VeriCWEty pipeline including data generation, embedding extraction, training, and inference evaluated on the test set of buggy modules (Step 8). Predicted CWEs are assigned to each test module. For line-level testing, module-level testing is first conducted to identify the CWE type. Module-level embeddings are then combined with line embeddings for evaluation. The classifier predicts which lines… view at source ↗

**Figure 3.** Figure 3: Line-level embeddings vs. Line-level + Module-level embeddings classification analysis. Starting from top left and going clock-wise (a) Metric [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have shown significant improvement in RTL code generation. Despite the advances, the generated code is often riddled with common vulnerabilities and weaknesses (CWEs) that can slip by untrained eyes. Attackers can often exploit these weaknesses to fulfill their nefarious motives. Existing RTL bug-detection techniques rely on rule-based checks, formal properties, or coarse-grained structural analysis, which either fail to capture semantic vulnerabilities or lack precise localization. In our work, we bridge this gap by proposing an embedding-based bug-detection framework that detects and classifies bugs at both module and line-level granularity. Our method achieves about 89% precision in identifying common CWEs such as CWE-1244 and CWE-1245, and 96% accuracy in detecting line-level bugs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies embeddings for line-level CWE detection in Verilog but the localization mechanism needs more detail to support the accuracy claims.

read the letter

The paper presents VeriCWEty, which offers an embedding-based method to find common weaknesses in Verilog designs, working at both the full module level and down to individual lines. The authors claim 89 percent precision on things like CWE-1244 and CWE-1245, plus 96 percent accuracy for spotting bugs at the line level. This targets the problem of vulnerabilities in code produced by large language models, which is becoming more common in hardware design. The new part is shifting from rule-based checks or broad structural analysis to something that uses learned embeddings for semantic detection in RTL. That could catch issues that formal properties miss. They do a decent job framing why this matters for security in computer architecture, especially as AI tools generate more of the code. On the positive side, the focus on line-level output is useful because it points engineers to exact spots to fix rather than whole modules. If the results hold, it could influence how verification is done in practice for generated hardware. But there are soft spots. The description does not lay out the model architecture, the training data details, or the exact way they achieve line-level predictions from embeddings. Standard embedding models often work on sequences or whole files, so getting per-line results requires something specific like separate embeddings per line or some attribution technique. Without that spelled out, or without ablations on context size, it's hard to know if the 96 percent accuracy reflects real localization or just module-level signals applied broadly. The stress-test concern holds some weight here. If the implementation classifies the module and then marks all lines, the line-level claim would be overstated. The paper should show comparisons to existing tools and statistical tests to back the numbers. This paper is for people working on hardware security and automated verification of Verilog. A reader who follows ML applications to code or RTL bugs would get some value from the approach, even if more details are needed. It deserves a serious referee to sort out the method and experiments. I recommend going ahead with peer review rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VeriCWEty, an embedding-enabled framework for detecting Common Weakness Enumerations (CWEs) in Verilog RTL code at both module and line-level granularity. It positions the approach as bridging gaps in rule-based, formal, and coarse-grained structural methods by using embeddings to capture semantic vulnerabilities, claiming approximately 89% precision on CWEs such as CWE-1244 and CWE-1245 along with 96% accuracy for line-level bug detection.

Significance. If the central claims hold under proper validation, the work could meaningfully advance hardware security tooling by offering semantic, localized CWE detection in Verilog where existing techniques are limited. The embedding-based line-level localization, if mechanistically sound, would represent a useful direction beyond module-level classification.

major comments (2)

[Abstract] Abstract: performance metrics (89% precision, 96% accuracy) are stated without any description of the dataset, model architecture, training procedure, baselines, validation splits, or statistical significance. This absence makes it impossible to determine whether the numbers support the central claim of effective line-level CWE detection.
[Method description] Method description: the embedding pipeline for line-level output is not specified (e.g., independent per-line embeddings, token-level classifiers, or post-hoc attribution). Without this mechanism or ablations on context window size, the reported 96% line-level accuracy risks reflecting module-level detection rather than true per-line localization, undermining the claimed advantage over rule-based tools.

minor comments (1)

[Abstract] Abstract: the phrase 'common CWEs such as CWE-1244 and CWE-1245' is used without enumerating all evaluated CWEs or providing concrete bug examples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's comments on the need for greater clarity in the abstract and method description. We respond to each point below, agreeing where revisions are needed to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: performance metrics (89% precision, 96% accuracy) are stated without any description of the dataset, model architecture, training procedure, baselines, validation splits, or statistical significance. This absence makes it impossible to determine whether the numbers support the central claim of effective line-level CWE detection.

Authors: We agree that the abstract does not provide sufficient context for the reported metrics. To address this, we will revise the abstract to include a concise description of the dataset used, the embedding model architecture, the training and validation procedures, and the baselines considered. This will allow readers to better assess the validity of the 89% precision and 96% accuracy claims. revision: yes
Referee: [Method description] Method description: the embedding pipeline for line-level output is not specified (e.g., independent per-line embeddings, token-level classifiers, or post-hoc attribution). Without this mechanism or ablations on context window size, the reported 96% line-level accuracy risks reflecting module-level detection rather than true per-line localization, undermining the claimed advantage over rule-based tools.

Authors: We acknowledge that the specific mechanism for generating line-level outputs from embeddings is not detailed in the manuscript. We will revise the method section to clearly specify the embedding pipeline, including how per-line classifications are obtained (e.g., via context-aware embeddings or attribution methods), and include ablations on context window sizes to confirm that the line-level accuracy reflects genuine localization rather than module-level effects. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical embedding-based ML framework for CWE detection in Verilog, reporting experimental precision and accuracy metrics on datasets. No mathematical derivations, equations, or self-referential predictions appear in the abstract or described approach. Claims rest on standard training of embeddings and classifiers rather than any tautological reduction of outputs to inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes that collapse the central result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no free parameters, axioms, or invented entities can be extracted. The approach depends on embeddings but provides no details on how they are obtained or used.

pith-pipeline@v0.9.0 · 5456 in / 1188 out tokens · 36817 ms · 2026-05-10T11:40:22.925256+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 unverdicted novelty 3.0

A survey of LLM applications in secure hardware design covering EDA synthesis, vulnerability analysis, countermeasures, and educational uses.
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 accept novelty 2.0

LLMs enable RTL code generation and vulnerability analysis in hardware design but introduce data contamination and adversarial risks that require red-teaming and dynamic benchmarking.

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages · cited by 1 Pith paper

[1]

A survey on hardware vulnerability analysis using machine learning,

Z. Pan and P. Mishra, “A survey on hardware vulnerability analysis using machine learning,”IEEE access, vol. 10, pp. 49508–49527, 2022

2022
[2]

Fixing hardware security bugs with large language models,

B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce, “Fixing hardware security bugs with large language models,”arXiv preprint arXiv:2302.01215, 2023

work page arXiv 2023
[3]

A survey on assertion- based hardware verification,

H. Witharana, Y . Lyu, S. Charles, and P. Mishra, “A survey on assertion- based hardware verification,”ACM Computing Surveys (CSUR), vol. 54, no. 11s, pp. 1–33, 2022

2022
[4]

Directed test generation for hardware validation: A survey,

A. Jayasena and P. Mishra, “Directed test generation for hardware validation: A survey,”ACM Computing Surveys, vol. 56, no. 5, pp. 1–36, 2024

2024
[5]

Principles of verifiable rtl design,

L. Bening, “Principles of verifiable rtl design,” inPrinciples of Veri- fiable RTL Design: A Functional Coding Style Supporting Verification Processes in Verilog, pp. 239–245, Boston, MA, USA: Springer US, 2001

2001
[6]

Don’t cweat it: Toward cwe analysis techniques in early stages of hardware design,

B. Ahmad, W.-K. Liu, L. Collini, H. Pearce, J. M. Fung, J. Valamehr, M. Bidmeshki, P. Sapiecha, S. Brown, K. Chakrabarty, R. Karri, and B. Tan, “Don’t cweat it: Toward cwe analysis techniques in early stages of hardware design,” inProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’22, (New York, NY , USA), Associati...

2022
[7]

LASHED: LLMs and static hardware analysis for early detection of RTL bugs,

B. Ahmad, H. Pearce, R. Karri, and B. Tan, “Lashed: Llms and static hardware analysis for early detection of rtl bugs,”arXiv preprint arXiv:2504.21770, 2025

work page arXiv 2025
[8]

Veriloglavd: Llm-aided rule generation for vulnerability detection in verilog,

X. Long, Y . Xia, X. Chen, and L. Kuang, “Veriloglavd: Llm-aided rule generation for vulnerability detection in verilog,”arXiv preprint arXiv:2508.13092, 2025

work page arXiv 2025
[9]

Large language model for vulnerability detection: Emerging results and future directions,

X. Zhou, T. Zhang, and D. Lo, “Large language model for vulnerability detection: Emerging results and future directions,” inProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineer- ing: New Ideas and Emerging Results, pp. 47–51, 2024

2024
[10]

TrojanLoC: Fine-grained hardware Trojan detection from Verilog code,

W. Xiao, Z. Wang, M. Shao, R. V . Hemadri, O. Sinanoglu, M. Shafique, J. Knechtel, S. Garg, and R. Karri, “Trojanloc: Llm-based framework for rtl trojan localization,”arXiv preprint arXiv:2512.00591, 2025

work page arXiv 2025
[11]

Veriloc: Line-of-code level prediction of hardware design quality from verilog code,

R. V . Hemadri, J. Bhandari, A. Nakkab, J. Knechtel, B. P. Gopalan, R. Narayanaswamy, R. Karri, and S. Garg, “Veriloc: Line-of-code level prediction of hardware design quality from verilog code,”arXiv preprint arXiv:2506.07239, 2025

work page arXiv 2025
[12]

Llm4sechw: Leveraging domain-specific large language model for hardware debug- ging,

W. Fu, K. Yang, R. G. Dutta, X. Guo, and G. Qu, “Llm4sechw: Leveraging domain-specific large language model for hardware debug- ging,” in2023 Asian hardware oriented security and trust symposium (AsianHOST), pp. 1–6, IEEE, 2023

2023
[13]

Llms and the future of chip design: Unveiling security risks and building trust,

Z. Wang, L. Alrahis, L. Mankali, J. Knechtel, and O. Sinanoglu, “Llms and the future of chip design: Unveiling security risks and building trust,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 385–390, IEEE, 2024

2024
[14]

Llm-assisted bug identification and correction for verilog hdl,

K. Qayyum, C. K. Jha, S. Ahmadi-Pour, M. Hassan, and R. Drechsler, “Llm-assisted bug identification and correction for verilog hdl,”ACM Transactions on Design Automation of Electronic Systems, vol. 30, no. 6, pp. 1–28, 2025

2025
[15]

Common Weakness Enumeration: A Community- Developed List of Software and Hardware Weaknesses

MITRE Corporation, “Common Weakness Enumeration: A Community- Developed List of Software and Hardware Weaknesses.” https://cwe. mitre.org/index.html, 2026. Accessed: 18 March 2026

2026
[16]

Security properties for open-source hardware designs,

J. Rogers, N. Shakeel, D. Mankani, S. Espinosa, C. Chabra, K. Ryan, and C. Sturton, “Security properties for open-source hardware designs,” arXiv preprint arXiv:2412.08769, 2024

work page arXiv 2024
[17]

Hunting security bugs in soc designs: Lessons learned,

M. M. Bidmeshki, Y . Zhang, M. Zaman, L. Zhou, and Y . Makris, “Hunting security bugs in soc designs: Lessons learned,”IEEE Design & Test, vol. 38, no. 1, pp. 22–29, 2021

2021
[18]

Hardfails: insights into software-exploitable hardware bugs,

G. Dessouky, D. Gens, P. Haney, G. Persyn, A. Kanuparthi, H. Khattri, J. M. Fung, A.-R. Sadeghi, and J. Rajendran, “Hardfails: insights into software-exploitable hardware bugs,” inProceedings of the 28th USENIX Conference on Security Symposium, SEC’19, (USA), p. 213–230, USENIX Association, 2019

2019
[19]

Rigorous engineering for hardware security: Formal modelling and proof in the cheri design and implemen- tation process,

K. Nienhuis, A. Joannou, T. Bauereiss, A. Fox, M. Roe, B. Campbell, M. Naylor, R. M. Norton, S. W. Moore, P. G. Neumann, I. Stark, R. N. M. Watson, and P. Sewell, “Rigorous engineering for hardware security: Formal modelling and proof in the cheri design and implemen- tation process,” in2020 IEEE Symposium on Security and Privacy (SP), pp. 1003–1020, 2020

2020
[20]

Invited: Formal verification of security critical hardware-firmware interactions in commercial socs,

S. Ray, N. Ghosh, R. J. Masti, A. Kanuparthi, and J. M. Fung, “Invited: Formal verification of security critical hardware-firmware interactions in commercial socs,” in2019 56th ACM/IEEE Design Automation Conference (DAC), pp. 1–4, 2019

2019
[21]

Rtl-contest: Concolic testing on rtl for detecting security vulnerabilities,

X. Meng, S. Kundu, A. K. Kanuparthi, and K. Basu, “Rtl-contest: Concolic testing on rtl for detecting security vulnerabilities,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 3, pp. 466–477, 2022

2022
[22]

Self-hwdebug: Automation of llm self- instructing for hardware security verification,

M. Akyash and H. M. Kamali, “Self-hwdebug: Automation of llm self- instructing for hardware security verification,” 2024

2024
[23]

Veriloglavd: Llm-aided rule generation for vulnerability detection in verilog,

X. Long, Y . Xia, X. Chen, and L. Kuang, “Veriloglavd: Llm-aided rule generation for vulnerability detection in verilog,” 2025

2025
[24]

All artificial, less intelligence: Genai through the lens of formal verification,

D. N. Gadde, A. Kumar, T. Nalapat, E. Rezunov, and F. Cappellini, “All artificial, less intelligence: Genai through the lens of formal verification,” 2024

2024
[25]

Bugwhisperer: Fine- tuning llms for soc hardware vulnerability detection,

S. Tarek, D. Saha, S. K. Saha, and F. Farahmandi, “Bugwhisperer: Fine- tuning llms for soc hardware vulnerability detection,” in2025 IEEE 43rd VLSI Test Symposium (VTS), pp. 1–5, 2025

2025
[26]

Lashed: Llms and static hardware analysis for early detection of rtl bugs,

B. Ahmad, H. Pearce, R. Karri, and B. Tan, “Lashed: Llms and static hardware analysis for early detection of rtl bugs,” 2025

2025
[27]

OpenRouter Platform

OpenRouter, “OpenRouter Platform.” https://openrouter.ai/, 2026. On- line; accessed March 20, 2026

2026
[28]

cl-verilog-1.0: Verilog fine-tuned language model

ajn313, “cl-verilog-1.0: Verilog fine-tuned language model.” https:// huggingface.co/ajn313/cl-verilog-1.0, 2026. Accessed: 2026-03-20

2026