AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

Jing Wang; Wenji Fang; Wenkai Li; Yuchao Wu; Zhiyao Xie; Ziyan Guo

arxiv: 2605.27472 · v1 · pith:CZ6M6KXCnew · submitted 2026-05-26 · 💻 cs.AR · cs.AI

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

Yuchao Wu , Wenji Fang , Jing Wang , Wenkai Li , Ziyan Guo , Zhiyao Xie This is my paper

Pith reviewed 2026-07-01 16:13 UTC · model grok-4.3

classification 💻 cs.AR cs.AI

keywords assertion generationLLM benchmarkhardware verificationSystemVerilog Assertionsbug detectionRTL designformal verificationmutation testing

0 comments

The pith

AssertLLM2 supplies 83 designs with specifications and buggy RTL to test whether LLMs can generate assertions that both prevent and detect hardware bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AssertLLM2 as a benchmark that supplies structured design specifications, verified golden RTL, and mutated buggy RTL for each of 83 designs across 13 categories. It defines two evaluation settings: one where assertions are generated from specifications alone to guard against errors, and one where assertions are generated from specifications plus buggy RTL to expose faults. The work claims this setup, together with metrics of syntactic validity, formal provability, coverage, and mutation-based bug detection, produces a more realistic assessment of LLM performance than prior benchmarks. A sympathetic reader would care because manual assertion writing is labor-intensive and error-prone, so improved automation could reduce verification cost in hardware design.

Core claim

AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate the bug-detection capability of LLM-generated assertions, supported by a dataset of 83 real-world designs, structured specifications, golden RTL, and systematically mutated variants, together with an evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection.

What carries the argument

The benchmark dataset of structured design specifications paired with verified golden RTL and mutated buggy RTL variants across 13 functional categories, enabling separate bug-prevention and bug-hunting tasks.

If this is right

LLM-generated assertions can be evaluated for their ability to expose injected faults in RTL rather than only for syntactic or coverage properties.
Baselines established on AssertLLM2 can be used to track progress of future LLMs on assertion tasks in hardware verification.
The two-task formulation separates the problem of writing correct assertions from the problem of writing assertions that detect specific implementation errors.
Designs in the benchmark include dependency-complete golden RTL, allowing assertions to be checked with formal tools without additional setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark's mutation-based scores correlate with real silicon bug escapes, teams could adopt the bug-hunting setting as a pre-silicon filter before tape-out.
The separation of bug-prevention and bug-hunting tasks suggests that different prompting or fine-tuning strategies may be optimal for each.
Extending the benchmark with timing or power assertions would test whether the same LLM approach generalizes beyond functional properties.

Load-bearing premise

The chosen set of 83 designs and the four evaluation metrics are sufficient to produce a realistic and extensive assessment of assertion generation performance.

What would settle it

A follow-up study that applies the same LLMs to a substantially larger or differently distributed collection of industrial designs and obtains materially different rankings on the mutation-based bug-detection metric would undermine the benchmark's claim to realistic coverage.

Figures

Figures reproduced from arXiv: 2605.27472 by Jing Wang, Wenji Fang, Wenkai Li, Yuchao Wu, Zhiyao Xie, Ziyan Guo.

**Figure 2.** Figure 2: Comparison of design-level scale (in both RTL lines [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Mutation Insertion Methodology, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the AssertLLM2 evaluation frame [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: COI coverage versus proof coverage. A signal may [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Per-design bug-prevention results across four generation settings on 12 representative designs. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Per-design bug-hunting results, measured by bug kill [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVerilog Assertions (SVAs) remains labor-intensive and error-prone. While Large Language Models (LLMs) show promise for automating this process, existing benchmarks remain limited by unrealistic task formulations, weak specification inputs, and oversimplified evaluation. To address these limitations, we introduce AssertLLM2, an open-source benchmark for realistic assertion generation in hardware verification. AssertLLM2 contains 83 real-world designs across 13 functional categories. For each design, the benchmark provides a structured design specification, a verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. These support two practical settings: bug-prevention, where assertions are generated from specifications to guard against design errors, and bug-hunting, where assertions are generated to expose discrepancies between intended behavior and faulty implementations. To the best of our knowledge, AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate bug-detection capability. AssertLLM2 further adopts a more rigorous evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection. Our benchmark enables a more realistic and extensive assessment of assertion generation and establishes rigorous baselines for state-of-the-art LLMs in practical hardware verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AssertLLM2 brings buggy RTL variants into assertion-generation benchmarks, which is a clear step forward, but the abstract leaves the actual quality of the 83 designs and the evaluation framework untested.

read the letter

The paper's main contribution is a benchmark with 83 designs that supplies structured specs, verified golden RTL, and mutated buggy variants. This setup supports two tasks: generating assertions from specs to prevent bugs, and generating them from buggy RTL to hunt discrepancies. The explicit bug-hunting path with faulty implementations as input looks new compared with earlier assertion benchmarks.

The construction itself is straightforward and addresses real pain points in hardware verification. Providing dependency-complete golden RTL and systematic mutations gives a concrete way to measure both syntactic validity and whether assertions catch injected bugs. The four-part evaluation (syntax, formal provability, coverage, mutation detection) is more complete than simple pass/fail checks.

The soft spot is that nothing in the abstract shows the 83 designs are representative or that the mutations produce realistic faults. Without the actual dataset or any model results, it is impossible to tell whether the framework separates strong LLMs from weak ones or whether the categories cover the kinds of corner cases that matter in practice. The "to the best of our knowledge" phrasing on novelty is honest but also signals the claim rests on the authors' literature search rather than a formal gap analysis.

This is for researchers building or evaluating LLM tools for SystemVerilog assertion generation. A hardware-verification group that needs a public testbed with bug-injection support would find it directly useful. The work is coherent on its own terms and the benchmark construction is the kind of thing that deserves referee time even if the final numbers need tightening.

Referee Report

0 major / 2 minor

Summary. The paper introduces AssertLLM2, an open-source benchmark for evaluating LLMs on generating SystemVerilog Assertions (SVAs) from design specifications. It comprises 83 real-world designs across 13 functional categories, each providing a structured specification, verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. The benchmark supports two settings—bug-prevention (assertions from specs to guard against errors) and bug-hunting (assertions to detect discrepancies in faulty implementations)—and employs an evaluation framework covering syntactic validity, formal provability, coverage, and mutation-based bug detection. It claims, to the best of its knowledge, to be the first benchmark explicitly using buggy RTL as input for assessing bug-detection capability.

Significance. If the benchmark construction, design selection, and evaluation framework hold as described, this work would offer a meaningful advance by addressing limitations in prior benchmarks through more realistic inputs and metrics focused on practical hardware verification. The explicit inclusion of buggy variants, support for both prevention and detection scenarios, and open-source release are concrete strengths that enable reproducible baselines and future comparisons in the field.

minor comments (2)

[Abstract] Abstract: the phrase 'establishes rigorous baselines' would benefit from a brief parenthetical note on the number of LLMs evaluated or the scale of the reported results to convey the extent of the empirical component.
[Benchmark Construction] The manuscript should ensure that the description of the mutation process for creating buggy RTL variants includes explicit details on the mutation operators and coverage of common bug types to strengthen reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the benchmark's contributions, and recommendation for minor revision. We appreciate the acknowledgment of the realistic inputs, dual settings, and open-source aspects as strengths.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a benchmark-construction paper whose central contribution is the definition and release of AssertLLM2 (83 designs, structured specs, golden RTL, mutated buggy variants, and four evaluation axes). No derivation chain, fitted parameters, or predictive claims exist that could reduce to the paper's own inputs by construction. The sole novelty statement is explicitly qualified ('to the best of our knowledge') and does not rely on self-citation load-bearing or uniqueness theorems imported from prior author work. The evaluation framework is described directly from the benchmark elements without any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper; no mathematical axioms, free parameters, or invented physical entities are present.

pith-pipeline@v0.9.1-grok · 5779 in / 1097 out tokens · 31106 ms · 2026-07-01T16:13:31.738841+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 11 canonical work pages · 5 internal anchors

[1]

Llm-assisted circuit verification: A comprehensive survey,

H. Liu, Y. Lu, M. Wang, X. Yao, and B. Yu, “Llm-assisted circuit verification: A comprehensive survey, ” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC), 2026

2026
[2]

A survey on assertion-based hardware verification,

H. Witharana, Y. Lyu, S. Charles, and P. Mishra, “A survey on assertion-based hardware verification, ”ACM Computing Surveys (CSUR), 2022

2022
[3]

Assertion-based verification: Industry myths to realities (invited tuto- rial),

H. Foster, “Assertion-based verification: Industry myths to realities (invited tuto- rial), ” inInternational Conference on Computer Aided Verification, 2008

2008
[4]

Spec2assertion: Automatic pre-rtl assertion generation using large language models with progressive regularization,

F. Wu, E. Pan, R. Kande, M. Quinn, A. Tyagi, D. K. Houngninou, J. Rajendran, and J. Hu, “Spec2assertion: Automatic pre-rtl assertion generation using large language models with progressive regularization, ”arXiv preprint arXiv:2505.07995, 2025

work page arXiv 2025
[5]

ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

L. T. Fu, J. Zhou, S. Ren, M. Zhang, J. Xiong, H. Jiang, N. Guan, X. Wang, and J. Yang, “Chatsva: Bridging sva generation for hardware verification via task-specific llms, ” arXiv preprint arXiv:2604.02811, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Assertcoder: Llm-based assertion gen- eration via multimodal specification extraction,

E. Tian, Y. Ci, Q. Yang, Y. Li, and Z. Lyu, “Assertcoder: Llm-based assertion gen- eration via multimodal specification extraction, ”arXiv preprint arXiv:2507.10338, 2025

work page arXiv 2025
[7]

Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl,

Y. Bai, G. B. Hamad, S. Suhaib, and H. Ren, “Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl, ” in2025 IEEE International Conference on LLM-Aided Design (ICLAD), 2025

2025
[8]

Assertgen: En- hancement of llm-aided assertion generation through cross-layer signal bridging,

H. Lyu, Y. Wang, Y. Du, M. Shi, Z. Chao, W. Li, T. Wang, and H. Li, “Assertgen: En- hancement of llm-aided assertion generation through cross-layer signal bridging, ” arXiv preprint arXiv:2509.23674, 2025

work page arXiv 2025
[9]

Assertminer: Module-level spec generation and assertion mining using static analysis guided llms,

H. Lyu, Y. Wang, J. Zhou, Z. Chao, T. Wang, and H. Li, “Assertminer: Module-level spec generation and assertion mining using static analysis guided llms, ” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC), 2026

2026
[10]

Assertfix: Empowering automated assertion fix via large language models,

H. Lyu, Y. Du, Y. Wang, Z. Chao, T. Wang, and H. Li, “Assertfix: Empowering automated assertion fix via large language models, ”arXiv preprint arXiv:2509.23972, 2025

work page arXiv 2025
[11]

Chiraag: Chatgpt informed rapid and automated assertion generation,

B. Mali, K. Maddala, V. Gupta, S. Reddy, C. Karfa, and R. Karri, “Chiraag: Chatgpt informed rapid and automated assertion generation, ” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2024

2024
[12]

Automated verilog assertion generation using fine-tuned llms with subtask-specific iterative prompt- ing,

M. Shahidzadeh, B. Ghavami, S. J. Wilton, and L. Shannon, “Automated verilog assertion generation using fine-tuned llms with subtask-specific iterative prompt- ing, ” in2025 26th International Symposium on Quality Electronic Design (ISQED), 2025

2025
[13]

Bridging rtl and assertion generation with large language models,

J. Thangellamudi and S. M. P. Dinakarrao, “Bridging rtl and assertion generation with large language models, ”IEEE Design & Test, 2026

2026
[14]

Enhancing large language models for hardware verification: A novel systemverilog assertion dataset,

A. Menon, S. Miftah, S. Kundu, S. Kundu, A. Srivastava, A. Raha, G. Sonnenschien, S. Banerjee, D. Mathaikutty, and K. Basu, “Enhancing large language models for hardware verification: A novel systemverilog assertion dataset, ”ACM Transactions on Design Automation of Electronic Systems, 2025

2025
[15]

Assertionbench: A benchmark to eval- uate large-language models for assertion generation,

V. Pulavarthi, D. Nandal, S. Dan, and D. Pal, “Assertionbench: A benchmark to eval- uate large-language models for assertion generation, ” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025

2025
[16]

Fveval: Understanding language model capabilities in formal verification of digital hardware,

M. Kang, M. Liu, G. B. Hamad, S. M. Suhaib, and H. Ren, “Fveval: Understanding language model capabilities in formal verification of digital hardware, ” in2025 Design, Automation & Test in Europe Conference (DATE), 2025

2025
[17]

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Eval- uating Large Language Models and Agents on RTL Design and Verification

N. Pinckney, C. Deng, C.-T. Ho, Y.-D. Tsai, M. Liu, W. Zhou, B. Khailany, and H. Ren, “Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, ” arXiv preprint arXiv:2506.14074, 2025

work page arXiv 2025
[18]

Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

W. Fang, M. Li, M. Li, Z. Yan, S. Liu, Z. Xie, and H. Zhang, “Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms, ”arXiv preprint arXiv:2402.00386, 2024

work page arXiv 2024
[19]

Detecting defects in software requirements specification,

A. A. Alshazly, A. M. Elfatatry, and M. S. Abougabal, “Detecting defects in software requirements specification, ”Alexandria Engineering Journal, 2014

2014
[20]

Natural language is not enough: Benchmarking multi-modal generative ai for verilog generation,

K. Chang, Z. Chen, Y. Zhou, W. Zhu, K. Wang, H. Xu, C. Li, M. Wang, S. Liang, H. Li et al., “Natural language is not enough: Benchmarking multi-modal generative ai for verilog generation, ” inProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024

2024
[21]

FreeCores,

FreeCores, “FreeCores, ” https://github.com/orgs/freecores
[22]

Hummingbirdv2 E203 Core and SoC,

riscv-mcu, “Hummingbirdv2 E203 Core and SoC, ” https://github.com/riscv-mcu/ e203_hbirdv2
[23]

OpenCores,

OpenCores, “OpenCores, ” https://opencores.org/
[24]

VexRiscv: A FPGA friendly 32 bit RISC-V CPU implementation,

SpinalHDL, “VexRiscv: A FPGA friendly 32 bit RISC-V CPU implementation, ” https://github.com/SpinalHDL/VexRiscv
[25]

Cadence Design Systems, Inc.,JasperGold Apps User’s Guide, Cadence Design Systems, Inc., Sep. 2015

2015
[26]

Mantra: Mutation testing of hardware design code based on real bugs,

J. Wu, Y. Lei, Z. Zhang, X. Meng, D. Yang, P. Li, J. He, and X. Mao, “Mantra: Mutation testing of hardware design code based on real bugs, ” in2023 60th ACM/IEEE Design Automation Conference (DAC), 2023

2023
[27]

Pyverilog: A python-based hardware design processing toolkit for verilog hdl,

S. Takamaeda-Yamazaki, “Pyverilog: A python-based hardware design processing toolkit for verilog hdl, ” inInternational Symposium on Applied Reconfigurable Computing, 2015

2015
[28]

Cadence Design Systems, Inc.,Conformal Equivalence Checking User Guide, Ca- dence Design Systems, Inc., Oct. 2021

2021
[29]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, ”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Claude sonnet 4.5 system card,

Anthropic, “Claude sonnet 4.5 system card, ” Anthropic, Tech. Rep., 2025

2025
[31]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, El-Kishkyet al., “Openai gpt-5 system card, ”arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong et al., “Deepseek-v3.2: Pushing the frontier of open large language models, ”arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Qwen3-Coder-Next Technical Report

R. Cao, M. Chen, J. Chen, Z. Cui, Y. Feng, B. Hui, Y. Jing, K. Li, M. Li, J. Linet al., “Qwen3-coder-next technical report, ”arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Meta, “The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. ” [Online]. Available: https://ai.meta.com/blog/llama-4-multimodal- intelligence

[1] [1]

Llm-assisted circuit verification: A comprehensive survey,

H. Liu, Y. Lu, M. Wang, X. Yao, and B. Yu, “Llm-assisted circuit verification: A comprehensive survey, ” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC), 2026

2026

[2] [2]

A survey on assertion-based hardware verification,

H. Witharana, Y. Lyu, S. Charles, and P. Mishra, “A survey on assertion-based hardware verification, ”ACM Computing Surveys (CSUR), 2022

2022

[3] [3]

Assertion-based verification: Industry myths to realities (invited tuto- rial),

H. Foster, “Assertion-based verification: Industry myths to realities (invited tuto- rial), ” inInternational Conference on Computer Aided Verification, 2008

2008

[4] [4]

Spec2assertion: Automatic pre-rtl assertion generation using large language models with progressive regularization,

F. Wu, E. Pan, R. Kande, M. Quinn, A. Tyagi, D. K. Houngninou, J. Rajendran, and J. Hu, “Spec2assertion: Automatic pre-rtl assertion generation using large language models with progressive regularization, ”arXiv preprint arXiv:2505.07995, 2025

work page arXiv 2025

[5] [5]

ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

L. T. Fu, J. Zhou, S. Ren, M. Zhang, J. Xiong, H. Jiang, N. Guan, X. Wang, and J. Yang, “Chatsva: Bridging sva generation for hardware verification via task-specific llms, ” arXiv preprint arXiv:2604.02811, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Assertcoder: Llm-based assertion gen- eration via multimodal specification extraction,

E. Tian, Y. Ci, Q. Yang, Y. Li, and Z. Lyu, “Assertcoder: Llm-based assertion gen- eration via multimodal specification extraction, ”arXiv preprint arXiv:2507.10338, 2025

work page arXiv 2025

[7] [7]

Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl,

Y. Bai, G. B. Hamad, S. Suhaib, and H. Ren, “Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl, ” in2025 IEEE International Conference on LLM-Aided Design (ICLAD), 2025

2025

[8] [8]

Assertgen: En- hancement of llm-aided assertion generation through cross-layer signal bridging,

H. Lyu, Y. Wang, Y. Du, M. Shi, Z. Chao, W. Li, T. Wang, and H. Li, “Assertgen: En- hancement of llm-aided assertion generation through cross-layer signal bridging, ” arXiv preprint arXiv:2509.23674, 2025

work page arXiv 2025

[9] [9]

Assertminer: Module-level spec generation and assertion mining using static analysis guided llms,

H. Lyu, Y. Wang, J. Zhou, Z. Chao, T. Wang, and H. Li, “Assertminer: Module-level spec generation and assertion mining using static analysis guided llms, ” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC), 2026

2026

[10] [10]

Assertfix: Empowering automated assertion fix via large language models,

H. Lyu, Y. Du, Y. Wang, Z. Chao, T. Wang, and H. Li, “Assertfix: Empowering automated assertion fix via large language models, ”arXiv preprint arXiv:2509.23972, 2025

work page arXiv 2025

[11] [11]

Chiraag: Chatgpt informed rapid and automated assertion generation,

B. Mali, K. Maddala, V. Gupta, S. Reddy, C. Karfa, and R. Karri, “Chiraag: Chatgpt informed rapid and automated assertion generation, ” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2024

2024

[12] [12]

Automated verilog assertion generation using fine-tuned llms with subtask-specific iterative prompt- ing,

M. Shahidzadeh, B. Ghavami, S. J. Wilton, and L. Shannon, “Automated verilog assertion generation using fine-tuned llms with subtask-specific iterative prompt- ing, ” in2025 26th International Symposium on Quality Electronic Design (ISQED), 2025

2025

[13] [13]

Bridging rtl and assertion generation with large language models,

J. Thangellamudi and S. M. P. Dinakarrao, “Bridging rtl and assertion generation with large language models, ”IEEE Design & Test, 2026

2026

[14] [14]

Enhancing large language models for hardware verification: A novel systemverilog assertion dataset,

A. Menon, S. Miftah, S. Kundu, S. Kundu, A. Srivastava, A. Raha, G. Sonnenschien, S. Banerjee, D. Mathaikutty, and K. Basu, “Enhancing large language models for hardware verification: A novel systemverilog assertion dataset, ”ACM Transactions on Design Automation of Electronic Systems, 2025

2025

[15] [15]

Assertionbench: A benchmark to eval- uate large-language models for assertion generation,

V. Pulavarthi, D. Nandal, S. Dan, and D. Pal, “Assertionbench: A benchmark to eval- uate large-language models for assertion generation, ” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025

2025

[16] [16]

Fveval: Understanding language model capabilities in formal verification of digital hardware,

M. Kang, M. Liu, G. B. Hamad, S. M. Suhaib, and H. Ren, “Fveval: Understanding language model capabilities in formal verification of digital hardware, ” in2025 Design, Automation & Test in Europe Conference (DATE), 2025

2025

[17] [17]

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Eval- uating Large Language Models and Agents on RTL Design and Verification

N. Pinckney, C. Deng, C.-T. Ho, Y.-D. Tsai, M. Liu, W. Zhou, B. Khailany, and H. Ren, “Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, ” arXiv preprint arXiv:2506.14074, 2025

work page arXiv 2025

[18] [18]

Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

W. Fang, M. Li, M. Li, Z. Yan, S. Liu, Z. Xie, and H. Zhang, “Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms, ”arXiv preprint arXiv:2402.00386, 2024

work page arXiv 2024

[19] [19]

Detecting defects in software requirements specification,

A. A. Alshazly, A. M. Elfatatry, and M. S. Abougabal, “Detecting defects in software requirements specification, ”Alexandria Engineering Journal, 2014

2014

[20] [20]

Natural language is not enough: Benchmarking multi-modal generative ai for verilog generation,

K. Chang, Z. Chen, Y. Zhou, W. Zhu, K. Wang, H. Xu, C. Li, M. Wang, S. Liang, H. Li et al., “Natural language is not enough: Benchmarking multi-modal generative ai for verilog generation, ” inProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024

2024

[21] [21]

FreeCores,

FreeCores, “FreeCores, ” https://github.com/orgs/freecores

[22] [22]

Hummingbirdv2 E203 Core and SoC,

riscv-mcu, “Hummingbirdv2 E203 Core and SoC, ” https://github.com/riscv-mcu/ e203_hbirdv2

[23] [23]

OpenCores,

OpenCores, “OpenCores, ” https://opencores.org/

[24] [24]

VexRiscv: A FPGA friendly 32 bit RISC-V CPU implementation,

SpinalHDL, “VexRiscv: A FPGA friendly 32 bit RISC-V CPU implementation, ” https://github.com/SpinalHDL/VexRiscv

[25] [25]

Cadence Design Systems, Inc.,JasperGold Apps User’s Guide, Cadence Design Systems, Inc., Sep. 2015

2015

[26] [26]

Mantra: Mutation testing of hardware design code based on real bugs,

J. Wu, Y. Lei, Z. Zhang, X. Meng, D. Yang, P. Li, J. He, and X. Mao, “Mantra: Mutation testing of hardware design code based on real bugs, ” in2023 60th ACM/IEEE Design Automation Conference (DAC), 2023

2023

[27] [27]

Pyverilog: A python-based hardware design processing toolkit for verilog hdl,

S. Takamaeda-Yamazaki, “Pyverilog: A python-based hardware design processing toolkit for verilog hdl, ” inInternational Symposium on Applied Reconfigurable Computing, 2015

2015

[28] [28]

Cadence Design Systems, Inc.,Conformal Equivalence Checking User Guide, Ca- dence Design Systems, Inc., Oct. 2021

2021

[29] [29]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, ”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Claude sonnet 4.5 system card,

Anthropic, “Claude sonnet 4.5 system card, ” Anthropic, Tech. Rep., 2025

2025

[31] [31]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, El-Kishkyet al., “Openai gpt-5 system card, ”arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong et al., “Deepseek-v3.2: Pushing the frontier of open large language models, ”arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Qwen3-Coder-Next Technical Report

R. Cao, M. Chen, J. Chen, Z. Cui, Y. Feng, B. Hui, Y. Jing, K. Li, M. Li, J. Linet al., “Qwen3-coder-next technical report, ”arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Meta, “The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. ” [Online]. Available: https://ai.meta.com/blog/llama-4-multimodal- intelligence