pith. sign in

arxiv: 2605.27472 · v1 · pith:CZ6M6KXCnew · submitted 2026-05-26 · 💻 cs.AR · cs.AI

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

Pith reviewed 2026-07-01 16:13 UTC · model grok-4.3

classification 💻 cs.AR cs.AI
keywords assertion generationLLM benchmarkhardware verificationSystemVerilog Assertionsbug detectionRTL designformal verificationmutation testing
0
0 comments X

The pith

AssertLLM2 supplies 83 designs with specifications and buggy RTL to test whether LLMs can generate assertions that both prevent and detect hardware bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AssertLLM2 as a benchmark that supplies structured design specifications, verified golden RTL, and mutated buggy RTL for each of 83 designs across 13 categories. It defines two evaluation settings: one where assertions are generated from specifications alone to guard against errors, and one where assertions are generated from specifications plus buggy RTL to expose faults. The work claims this setup, together with metrics of syntactic validity, formal provability, coverage, and mutation-based bug detection, produces a more realistic assessment of LLM performance than prior benchmarks. A sympathetic reader would care because manual assertion writing is labor-intensive and error-prone, so improved automation could reduce verification cost in hardware design.

Core claim

AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate the bug-detection capability of LLM-generated assertions, supported by a dataset of 83 real-world designs, structured specifications, golden RTL, and systematically mutated variants, together with an evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection.

What carries the argument

The benchmark dataset of structured design specifications paired with verified golden RTL and mutated buggy RTL variants across 13 functional categories, enabling separate bug-prevention and bug-hunting tasks.

If this is right

  • LLM-generated assertions can be evaluated for their ability to expose injected faults in RTL rather than only for syntactic or coverage properties.
  • Baselines established on AssertLLM2 can be used to track progress of future LLMs on assertion tasks in hardware verification.
  • The two-task formulation separates the problem of writing correct assertions from the problem of writing assertions that detect specific implementation errors.
  • Designs in the benchmark include dependency-complete golden RTL, allowing assertions to be checked with formal tools without additional setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark's mutation-based scores correlate with real silicon bug escapes, teams could adopt the bug-hunting setting as a pre-silicon filter before tape-out.
  • The separation of bug-prevention and bug-hunting tasks suggests that different prompting or fine-tuning strategies may be optimal for each.
  • Extending the benchmark with timing or power assertions would test whether the same LLM approach generalizes beyond functional properties.

Load-bearing premise

The chosen set of 83 designs and the four evaluation metrics are sufficient to produce a realistic and extensive assessment of assertion generation performance.

What would settle it

A follow-up study that applies the same LLMs to a substantially larger or differently distributed collection of industrial designs and obtains materially different rankings on the mutation-based bug-detection metric would undermine the benchmark's claim to realistic coverage.

Figures

Figures reproduced from arXiv: 2605.27472 by Jing Wang, Wenji Fang, Wenkai Li, Yuchao Wu, Zhiyao Xie, Ziyan Guo.

Figure 1
Figure 1. Figure 1: Overview of the AssertLLM2 framework. Unlike prior [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of design-level scale (in both RTL lines [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Mutation Insertion Methodology, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the AssertLLM2 evaluation frame [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: COI coverage versus proof coverage. A signal may [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-design bug-prevention results across four generation settings on 12 representative designs. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-design bug-hunting results, measured by bug kill [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVerilog Assertions (SVAs) remains labor-intensive and error-prone. While Large Language Models (LLMs) show promise for automating this process, existing benchmarks remain limited by unrealistic task formulations, weak specification inputs, and oversimplified evaluation. To address these limitations, we introduce AssertLLM2, an open-source benchmark for realistic assertion generation in hardware verification. AssertLLM2 contains 83 real-world designs across 13 functional categories. For each design, the benchmark provides a structured design specification, a verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. These support two practical settings: bug-prevention, where assertions are generated from specifications to guard against design errors, and bug-hunting, where assertions are generated to expose discrepancies between intended behavior and faulty implementations. To the best of our knowledge, AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate bug-detection capability. AssertLLM2 further adopts a more rigorous evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection. Our benchmark enables a more realistic and extensive assessment of assertion generation and establishes rigorous baselines for state-of-the-art LLMs in practical hardware verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces AssertLLM2, an open-source benchmark for evaluating LLMs on generating SystemVerilog Assertions (SVAs) from design specifications. It comprises 83 real-world designs across 13 functional categories, each providing a structured specification, verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. The benchmark supports two settings—bug-prevention (assertions from specs to guard against errors) and bug-hunting (assertions to detect discrepancies in faulty implementations)—and employs an evaluation framework covering syntactic validity, formal provability, coverage, and mutation-based bug detection. It claims, to the best of its knowledge, to be the first benchmark explicitly using buggy RTL as input for assessing bug-detection capability.

Significance. If the benchmark construction, design selection, and evaluation framework hold as described, this work would offer a meaningful advance by addressing limitations in prior benchmarks through more realistic inputs and metrics focused on practical hardware verification. The explicit inclusion of buggy variants, support for both prevention and detection scenarios, and open-source release are concrete strengths that enable reproducible baselines and future comparisons in the field.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'establishes rigorous baselines' would benefit from a brief parenthetical note on the number of LLMs evaluated or the scale of the reported results to convey the extent of the empirical component.
  2. [Benchmark Construction] The manuscript should ensure that the description of the mutation process for creating buggy RTL variants includes explicit details on the mutation operators and coverage of common bug types to strengthen reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the benchmark's contributions, and recommendation for minor revision. We appreciate the acknowledgment of the realistic inputs, dual settings, and open-source aspects as strengths.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a benchmark-construction paper whose central contribution is the definition and release of AssertLLM2 (83 designs, structured specs, golden RTL, mutated buggy variants, and four evaluation axes). No derivation chain, fitted parameters, or predictive claims exist that could reduce to the paper's own inputs by construction. The sole novelty statement is explicitly qualified ('to the best of our knowledge') and does not rely on self-citation load-bearing or uniqueness theorems imported from prior author work. The evaluation framework is described directly from the benchmark elements without any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper; no mathematical axioms, free parameters, or invented physical entities are present.

pith-pipeline@v0.9.1-grok · 5779 in / 1097 out tokens · 31106 ms · 2026-07-01T16:13:31.738841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Llm-assisted circuit verification: A comprehensive survey,

    H. Liu, Y. Lu, M. Wang, X. Yao, and B. Yu, “Llm-assisted circuit verification: A comprehensive survey, ” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC), 2026

  2. [2]

    A survey on assertion-based hardware verification,

    H. Witharana, Y. Lyu, S. Charles, and P. Mishra, “A survey on assertion-based hardware verification, ”ACM Computing Surveys (CSUR), 2022

  3. [3]

    Assertion-based verification: Industry myths to realities (invited tuto- rial),

    H. Foster, “Assertion-based verification: Industry myths to realities (invited tuto- rial), ” inInternational Conference on Computer Aided Verification, 2008

  4. [4]

    Spec2assertion: Automatic pre-rtl assertion generation using large language models with progressive regularization,

    F. Wu, E. Pan, R. Kande, M. Quinn, A. Tyagi, D. K. Houngninou, J. Rajendran, and J. Hu, “Spec2assertion: Automatic pre-rtl assertion generation using large language models with progressive regularization, ”arXiv preprint arXiv:2505.07995, 2025

  5. [5]

    ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

    L. T. Fu, J. Zhou, S. Ren, M. Zhang, J. Xiong, H. Jiang, N. Guan, X. Wang, and J. Yang, “Chatsva: Bridging sva generation for hardware verification via task-specific llms, ” arXiv preprint arXiv:2604.02811, 2026

  6. [6]

    Assertcoder: Llm-based assertion gen- eration via multimodal specification extraction,

    E. Tian, Y. Ci, Q. Yang, Y. Li, and Z. Lyu, “Assertcoder: Llm-based assertion gen- eration via multimodal specification extraction, ”arXiv preprint arXiv:2507.10338, 2025

  7. [7]

    Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl,

    Y. Bai, G. B. Hamad, S. Suhaib, and H. Ren, “Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl, ” in2025 IEEE International Conference on LLM-Aided Design (ICLAD), 2025

  8. [8]

    Assertgen: En- hancement of llm-aided assertion generation through cross-layer signal bridging,

    H. Lyu, Y. Wang, Y. Du, M. Shi, Z. Chao, W. Li, T. Wang, and H. Li, “Assertgen: En- hancement of llm-aided assertion generation through cross-layer signal bridging, ” arXiv preprint arXiv:2509.23674, 2025

  9. [9]

    Assertminer: Module-level spec generation and assertion mining using static analysis guided llms,

    H. Lyu, Y. Wang, J. Zhou, Z. Chao, T. Wang, and H. Li, “Assertminer: Module-level spec generation and assertion mining using static analysis guided llms, ” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC), 2026

  10. [10]

    Assertfix: Empowering automated assertion fix via large language models,

    H. Lyu, Y. Du, Y. Wang, Z. Chao, T. Wang, and H. Li, “Assertfix: Empowering automated assertion fix via large language models, ”arXiv preprint arXiv:2509.23972, 2025

  11. [11]

    Chiraag: Chatgpt informed rapid and automated assertion generation,

    B. Mali, K. Maddala, V. Gupta, S. Reddy, C. Karfa, and R. Karri, “Chiraag: Chatgpt informed rapid and automated assertion generation, ” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2024

  12. [12]

    Automated verilog assertion generation using fine-tuned llms with subtask-specific iterative prompt- ing,

    M. Shahidzadeh, B. Ghavami, S. J. Wilton, and L. Shannon, “Automated verilog assertion generation using fine-tuned llms with subtask-specific iterative prompt- ing, ” in2025 26th International Symposium on Quality Electronic Design (ISQED), 2025

  13. [13]

    Bridging rtl and assertion generation with large language models,

    J. Thangellamudi and S. M. P. Dinakarrao, “Bridging rtl and assertion generation with large language models, ”IEEE Design & Test, 2026

  14. [14]

    Enhancing large language models for hardware verification: A novel systemverilog assertion dataset,

    A. Menon, S. Miftah, S. Kundu, S. Kundu, A. Srivastava, A. Raha, G. Sonnenschien, S. Banerjee, D. Mathaikutty, and K. Basu, “Enhancing large language models for hardware verification: A novel systemverilog assertion dataset, ”ACM Transactions on Design Automation of Electronic Systems, 2025

  15. [15]

    Assertionbench: A benchmark to eval- uate large-language models for assertion generation,

    V. Pulavarthi, D. Nandal, S. Dan, and D. Pal, “Assertionbench: A benchmark to eval- uate large-language models for assertion generation, ” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025

  16. [16]

    Fveval: Understanding language model capabilities in formal verification of digital hardware,

    M. Kang, M. Liu, G. B. Hamad, S. M. Suhaib, and H. Ren, “Fveval: Understanding language model capabilities in formal verification of digital hardware, ” in2025 Design, Automation & Test in Europe Conference (DATE), 2025

  17. [17]

    Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Eval- uating Large Language Models and Agents on RTL Design and Verification

    N. Pinckney, C. Deng, C.-T. Ho, Y.-D. Tsai, M. Liu, W. Zhou, B. Khailany, and H. Ren, “Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, ” arXiv preprint arXiv:2506.14074, 2025

  18. [18]

    Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

    W. Fang, M. Li, M. Li, Z. Yan, S. Liu, Z. Xie, and H. Zhang, “Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms, ”arXiv preprint arXiv:2402.00386, 2024

  19. [19]

    Detecting defects in software requirements specification,

    A. A. Alshazly, A. M. Elfatatry, and M. S. Abougabal, “Detecting defects in software requirements specification, ”Alexandria Engineering Journal, 2014

  20. [20]

    Natural language is not enough: Benchmarking multi-modal generative ai for verilog generation,

    K. Chang, Z. Chen, Y. Zhou, W. Zhu, K. Wang, H. Xu, C. Li, M. Wang, S. Liang, H. Li et al., “Natural language is not enough: Benchmarking multi-modal generative ai for verilog generation, ” inProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024

  21. [21]

    FreeCores,

    FreeCores, “FreeCores, ” https://github.com/orgs/freecores

  22. [22]

    Hummingbirdv2 E203 Core and SoC,

    riscv-mcu, “Hummingbirdv2 E203 Core and SoC, ” https://github.com/riscv-mcu/ e203_hbirdv2

  23. [23]

    OpenCores,

    OpenCores, “OpenCores, ” https://opencores.org/

  24. [24]

    VexRiscv: A FPGA friendly 32 bit RISC-V CPU implementation,

    SpinalHDL, “VexRiscv: A FPGA friendly 32 bit RISC-V CPU implementation, ” https://github.com/SpinalHDL/VexRiscv

  25. [25]

    Cadence Design Systems, Inc.,JasperGold Apps User’s Guide, Cadence Design Systems, Inc., Sep. 2015

  26. [26]

    Mantra: Mutation testing of hardware design code based on real bugs,

    J. Wu, Y. Lei, Z. Zhang, X. Meng, D. Yang, P. Li, J. He, and X. Mao, “Mantra: Mutation testing of hardware design code based on real bugs, ” in2023 60th ACM/IEEE Design Automation Conference (DAC), 2023

  27. [27]

    Pyverilog: A python-based hardware design processing toolkit for verilog hdl,

    S. Takamaeda-Yamazaki, “Pyverilog: A python-based hardware design processing toolkit for verilog hdl, ” inInternational Symposium on Applied Reconfigurable Computing, 2015

  28. [28]

    Cadence Design Systems, Inc.,Conformal Equivalence Checking User Guide, Ca- dence Design Systems, Inc., Oct. 2021

  29. [29]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, ”arXiv preprint arXiv:2507.06261, 2025

  30. [30]

    Claude sonnet 4.5 system card,

    Anthropic, “Claude sonnet 4.5 system card, ” Anthropic, Tech. Rep., 2025

  31. [31]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, El-Kishkyet al., “Openai gpt-5 system card, ”arXiv preprint arXiv:2601.03267, 2025

  32. [32]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong et al., “Deepseek-v3.2: Pushing the frontier of open large language models, ”arXiv preprint arXiv:2512.02556, 2025

  33. [33]

    Qwen3-Coder-Next Technical Report

    R. Cao, M. Chen, J. Chen, Z. Cui, Y. Feng, B. Hui, Y. Jing, K. Li, M. Li, J. Linet al., “Qwen3-coder-next technical report, ”arXiv preprint arXiv:2603.00729, 2026

  34. [34]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

    Meta, “The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. ” [Online]. Available: https://ai.meta.com/blog/llama-4-multimodal- intelligence