Recognition: unknown
Automated SVA Generation with LLMs
Pith reviewed 2026-05-10 16:09 UTC · model grok-4.3
The pith
SVA Generator translates natural-language property descriptions into SystemVerilog Assertions with substantially higher semantic fidelity than general LLMs on complex cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SVA Generator employs AST-grounded constraint injection and an automated supervision pipeline to translate SVADs into SVAs, achieving comparable Syntax Pass Rates to strong general LLM baselines while delivering substantially higher Semantic Equivalence Rates on deeper tiers (+24.5 pp on D2, +26.0 pp on D3, +17.5 pp on D4), for a +22.7 pp average SER improvement over D2–D4, as quantified by formal mutual-implication checks.
What carries the argument
AST-grounded constraint injection combined with an automated supervision pipeline that enforces structural consistency, reduces hallucinations via de-duplication and constraint checks.
If this is right
- Assertion writing in IC verification can shift from manual effort toward reliable LLM-assisted generation for properties of increasing structural depth.
- Separate measurement of syntax and semantic equivalence provides a clearer signal for iterating on domain-specific LLM pipelines than syntax alone.
- Depth-stratified benchmarks expose where general LLMs lose semantic fidelity, guiding targeted improvements beyond surface-level generation.
- The supervision pipeline's de-duplication and constraint steps can be reused to stabilize outputs in other hardware-description tasks.
Where Pith is reading between the lines
- The same constraint-injection pattern could be adapted to generate properties in other temporal-logic or assertion languages used in software and hardware.
- Integration with existing EDA simulation and formal tools would allow end-to-end testing of the generated SVAs on industrial designs.
- Extending the benchmark to include real-world industrial SVAs would test whether the observed gains hold outside the curated AST-depth tiers.
Load-bearing premise
Formal property equivalence checking via mutual implication under identical clocking and environment assumptions fully captures semantic correctness without overlooking subtle behavioral differences.
What would settle it
A set of generated and reference properties that pass the mutual-implication check yet produce different simulation traces or fail under altered environment assumptions.
Figures
read the original abstract
Functional verification remains a dominant cost in modern IC development, and SystemVerilog Assertions (SVAs) are critical for simulation-based monitoring and formal property checking. However, writing SVAs by hand is time-consuming and error-prone. Directly prompting general-purpose large language models (LLMs) is also unreliable: the generated properties are often syntactically invalid or semantically incorrect, and the problem is exacerbated by scarce, high-quality domain training data. We present SVA Generator, a data-centric framework that translates natural-language SVA Descriptions (SVADs) into executable SVAs. It uses AST-grounded constraint injection and an automated supervision pipeline that enforces structural consistency and reduces hallucinations via de-duplication and constraint checks. To enable rigorous evaluation, we introduce a benchmark suite stratified by AST depth and use formal property equivalence checking to quantify semantic correctness separately from syntax validity, by checking mutual implication between the generated and reference properties under the same clocking and environment assumptions. Across all difficulty tiers, SVA Generator achieves comparable Syntax Pass Rate (SPR) to strong general LLM baselines, while delivering substantially higher Semantic Equivalence Rate (SER) on deeper tiers: +24.5 pp on D2, +26.0 pp on D3, and +17.5 pp on D4 relative to the best-performing general LLM, corresponding to a +22.7 pp SER improvement on average over D2--D4. These results highlight that high-fidelity data construction and depth-stratified benchmarking are key to reliable, semantics-preserving SVA generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SVA Generator, a data-centric framework that translates natural-language SVA Descriptions (SVADs) into executable SystemVerilog Assertions using LLMs augmented with AST-grounded constraint injection and an automated supervision pipeline (de-duplication and constraint checks) to reduce hallucinations. It introduces an AST-depth-stratified benchmark suite and evaluates using Syntax Pass Rate (SPR) together with Semantic Equivalence Rate (SER) obtained via formal mutual-implication checking under fixed clocking and environment assumptions. The central empirical claim is that the approach matches strong general-purpose LLM baselines on SPR while delivering substantially higher SER on deeper tiers (+24.5 pp on D2, +26.0 pp on D3, +17.5 pp on D4; +22.7 pp average over D2–D4).
Significance. If the reported SER gains are robust, the work would offer a practical advance in reducing the dominant cost of functional verification in IC design by improving the reliability of automated SVA generation, especially for complex properties. The explicit separation of syntactic validity from semantic equivalence via formal checking, together with depth-stratified benchmarking, constitutes a methodological strength that is still uncommon in LLM-for-hardware papers and could serve as a template for future evaluations.
major comments (2)
- [Evaluation section] Evaluation section: The headline SER improvement is obtained exclusively by checking mutual implication between generated and reference properties under identical clocking and environment assumptions. This metric can miss subtle SVA semantic divergences (reset handling, disable iff, overlapping assertions, or liveness vs. safety distinctions) that do not produce counter-examples under the supplied assumptions. The manuscript provides no validation of the assumptions against broader trace sets or alternative checkers, so the +22.7 pp average gain may be partly an artifact of the chosen equivalence definition rather than true semantic fidelity.
- [Benchmark construction section] Benchmark construction section: The AST-depth stratification is load-bearing for the claim that gains are larger on deeper tiers. The paper does not report how the natural-language SVADs were authored, reviewed for realism, or sampled from industrial designs; without such evidence it remains unclear whether the D2–D4 subsets are representative of real-world SVA usage, limiting the generalizability of the reported improvements.
minor comments (2)
- [Abstract and §3] Abstract and §3: The supervision pipeline is described only at high level ('de-duplication and constraint checks'); a concise enumeration of the concrete checks performed would improve reproducibility.
- [Throughout] Throughout: Acronyms SPR, SER, SVAD, and AST are used before their first definitions; a single glossary or consistent first-use expansion would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We are pleased that the significance of the work and the methodological contributions are recognized. We provide point-by-point responses to the major comments below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: The headline SER improvement is obtained exclusively by checking mutual implication between generated and reference properties under identical clocking and environment assumptions. This metric can miss subtle SVA semantic divergences (reset handling, disable iff, overlapping assertions, or liveness vs. safety distinctions) that do not produce counter-examples under the supplied assumptions. The manuscript provides no validation of the assumptions against broader trace sets or alternative checkers, so the +22.7 pp average gain may be partly an artifact of the chosen equivalence definition rather than true semantic fidelity.
Authors: We agree that mutual implication checking under fixed assumptions provides a conservative but not exhaustive measure of semantic equivalence, as it may not detect divergences that only manifest under different reset conditions or liveness properties outside the assumed environment. This is a known limitation of bounded formal checking in general. However, by using the same assumptions for both generated and reference properties, the metric fairly assesses whether the LLM-generated assertion captures the intended semantics as expressed in the reference under standard verification setups. We will revise the evaluation section to explicitly discuss these limitations of the SER metric and suggest directions for more comprehensive validation, such as using broader trace sets or multiple checkers in future extensions. The reported gains still demonstrate improved fidelity to the reference properties within the evaluated scope. revision: partial
-
Referee: [Benchmark construction section] Benchmark construction section: The AST-depth stratification is load-bearing for the claim that gains are larger on deeper tiers. The paper does not report how the natural-language SVADs were authored, reviewed for realism, or sampled from industrial designs; without such evidence it remains unclear whether the D2–D4 subsets are representative of real-world SVA usage, limiting the generalizability of the reported improvements.
Authors: The SVADs were manually authored by the authors to correspond to properties with controlled AST depths, based on common SVA patterns from verification textbooks and our experience with industrial designs. They were reviewed internally for syntactic correctness and semantic clarity. However, the manuscript does not detail this process or claim statistical sampling from a large industrial corpus. We acknowledge that this affects claims of broad representativeness. We will revise the benchmark construction section to describe the authoring and review process, and to clarify that the benchmark is intended as a controlled, depth-stratified suite rather than a comprehensive industrial dataset. This will help readers better interpret the generalizability of the results. revision: yes
Circularity Check
No circularity: empirical metrics defined independently of method internals
full rationale
The paper presents an empirical framework (SVA Generator) with AST-grounded constraints and a supervision pipeline, evaluated via externally defined metrics (SPR for syntax, SER via formal mutual implication under fixed clocking/environment assumptions) against independent general LLM baselines. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or description. The benchmark stratification and equivalence checking are presented as separate evaluation tools, not derived from the method itself. This is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The 2022 wilson re- search group functional verification study,
Harry Foster, “The 2022 wilson re- search group functional verification study,” https://blogs.sw.siemens.com/verificationhorizons/2023/01/23/epilogue- the-2022-wilson-research-group-functional-verification-study/, 2023, accessed: 2025-07-14
2022
-
[2]
A survey on assertion- based hardware verification,
H. Witharana, Y . Lyu, S. Charles, and P. Mishra, “A survey on assertion- based hardware verification,”ACM Computing Surveys (CSUR), vol. 54, no. 11s, pp. 1–33, 2022
2022
-
[3]
Formal verification methods,
O. Hasan and S. Tahar, “Formal verification methods,” inEncyclopedia of Information Science and Technology, Third Edition. IGI global, 2015, pp. 7162–7170
2015
-
[4]
Revolution or hype? seeking the limits of large models in hardware design,
Q. Xu, L. Stok, R. Drechsler, X. Wang, G. L. Zhang, and I. L. Markov, “Revolution or hype? seeking the limits of large models in hardware design,” in2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2025, pp. 1–9
2025
-
[5]
ChatCPU: An agile cpu design and verification platform with LLM,
X. Wang, G. W. Wan, S. Z. Wong, L. Zhang, T. Liu, Q. Tian, and J. Ye, “ChatCPU: An agile cpu design and verification platform with LLM,” in Proceedings of the 61st ACM/IEEE Design Automation Conference, ser. DAC ’24. New York, NY , USA: Association for Computing Machinery,
-
[6]
Available: https://doi.org/10.1145/3649329.3658493
[Online]. Available: https://doi.org/10.1145/3649329.3658493
-
[7]
idse: Navigating design space exploration in high-level synthesis using llms,
R. Li, J. Xiong, and X. Wang, “iDSE: Navigating design space exploration in high-level synthesis using llms,” 2025. [Online]. Available: https://arxiv.org/abs/2505.22086
-
[8]
ReChisel: Effective automatic chisel code generation by llm with reflection,
J. Niu, X. Liu, D. Niu, X. Wang, Z. Jiang, and N. Guan, “ReChisel: Effective automatic chisel code generation by llm with reflection,” 2025, to appear. [Online]. Available: https://arxiv.org/abs/2505.19734
-
[9]
Location is key: Leveraging llm for functional bug localization in verilog design,
B. Yao, N. Wang, J. Zhou, X. Wang, H. Gao, Z. Jiang, and N. Guan, “Location is key: Leveraging llm for functional bug localization in verilog design,” inProceedings of the 62nd Annual ACM/IEEE Design Automation Conference, ser. DAC ’25. IEEE Press, 2025
2025
-
[10]
VGV: Verilog generation using visual capabilities of multi-modal large language models,
S. Z. Wong, G.-W. Wan, D. Liu, and X. Wang, “VGV: Verilog generation using visual capabilities of multi-modal large language models,” in2024 IEEE LLM Aided Design Workshop (LAD), 2024, pp. 1–5
2024
-
[11]
Fixme: Towards end-to-end benchmarking of llm-aided design verification,
G.-W. Wan, S. Wong, S. Su, C. Niu, N. Wang, X. Wan, Q. Chen, M. Xing, J. Zhang, J. Ye, Y . Wang, R. Song, T. Ni, Q. Xu, N. Guan, Z. Jiang, X. Wang, Y . Chen, and J. Yang, “Fixme: Towards end-to-end benchmarking of llm-aided design verification,”Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1087–1095, 2026
2026
-
[12]
Genben: A genarative benchmark for LLM-aided design,
G. W. Wan, Y . Wang, S. Z. Wong, J. Xiong, Q. Chen, J. Zhang, M. Zhang, T. Ni, M. Xing, Y . Hua, Z. Jiang, N. Guan, Y . Wang, N. Xu, Q. Xu, and X. Wang, “Genben: A genarative benchmark for LLM-aided design,” 2025. [Online]. Available: https://openreview.net/forum?id=gtV o4xcpFI
2025
-
[13]
AssertLLM: Generating hardware verification assertions from design specifications via multi-llms,
Z. Yan, W. Fang, M. Li, M. Li, S. Liu, Z. Xie, and H. Zhang, “AssertLLM: Generating hardware verification assertions from design specifications via multi-llms,” inProceedings of the 30th Asia and South Pacific Design Automation Conference, ser. ASPDAC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 614–621
2025
-
[14]
ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs
L. T. Fu, J. Zhou, S. Ren, M. Zhang, J. Xiong, H. Jiang, N. Guan, X. Wang, and J. Yang, “ChatSV A: Bridging SV A generation for hardware verification via task-specific LLMs,” inProceedings of the 63rd ACM/IEEE Design Automation Conference, ser. DAC ’26, 2026, to appear. [Online]. Available: https://arxiv.org/abs/2604.02811
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Harm: A hint-based assertion miner,
S. Germiniani and G. Pravadelli, “Harm: A hint-based assertion miner,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4277–4288, 2022
2022
-
[16]
A-team: Automatic template- based assertion miner,
A. Danese, N. D. Riva, and G. Pravadelli, “A-team: Automatic template- based assertion miner,” in2017 54th ACM/EDAC/IEEE Design Automa- tion Conference (DAC), 2017, pp. 1–6
2017
-
[17]
Goldmine: Automatic assertion generation using data mining and static analysis,
S. Vasudevan, D. Sheridan, S. Patel, D. Tcheng, B. Tuohy, and D. John- son, “Goldmine: Automatic assertion generation using data mining and static analysis,” in2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), 2010, pp. 626–629
2010
-
[18]
Glast: Learning formal grammars to translate natural language specifications into hardware assertions,
C. B. Harris and I. G. Harris, “Glast: Learning formal grammars to translate natural language specifications into hardware assertions,” in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016, pp. 966–971
2016
-
[19]
Automatic assertion generation from natural language specifications using subtree analysis,
J. Zhao and I. G. Harris, “Automatic assertion generation from natural language specifications using subtree analysis,” in2019 Design, Au- tomation & Test in Europe Conference & Exhibition (DATE), 2019, pp. 598–601
2019
-
[20]
Controlled natural language frame- work for generating assertions from hardware specifications,
R. Krishnamurthy and M. S. Hsiao, “Controlled natural language frame- work for generating assertions from hardware specifications,” in2019 IEEE 13th International Conference on Semantic Computing (ICSC), 2019, pp. 367–370
2019
-
[21]
Automated assertion generation from natural language specifications,
S. J. Frederiksen, J. Aromando, and M. S. Hsiao, “Automated assertion generation from natural language specifications,” in2020 IEEE Interna- tional Test Conference (ITC), 2020, pp. 1–5
2020
-
[22]
Chatbot-based assertion generation from natural language specifications,
O. Keszocze and I. G. Harris, “Chatbot-based assertion generation from natural language specifications,” in2019 Forum for Specification and Design Languages (FDL), 2019, pp. 1–6
2019
-
[23]
Hybrid rule-based and machine learning system for assertion generation from natural language specifications,
F. Aditi and M. S. Hsiao, “Hybrid rule-based and machine learning system for assertion generation from natural language specifications,” in2022 IEEE 31st Asian Test Symposium (ATS), 2022, pp. 126–131
2022
-
[24]
Spectosva: Circuit specification document to systemverilog assertion translation,
G. Parthasarathy, S. Nanda, P. Choudhary, and P. Patil, “Spectosva: Circuit specification document to systemverilog assertion translation,” in2021 Second Document Intelligence Workshop at KDD, 2021
2021
-
[25]
K. Xu, J. Sun, Y . Hu, X. Fang, W. Shan, X. Wang, and Z. Jiang,MEIC: Re-thinking RTL Debug Automation using LLMs. New York, NY , USA: Association for Computing Machinery, 2025
2025
-
[26]
UVLLM: An automated universal rtl verification framework using LLMs,
Y . Hu, J. Ye, K. Xu, J. Sun, S. Zhang, X. Jiao, D. Pan, J. Zhou, N. Wang, W. Shan, X. Fang, X. Wang, N. Guan, and Z. Jiang, “UVLLM: An automated universal rtl verification framework using LLMs,” inProceedings of the 62nd ACM/IEEE Design Automation Conference, ser. DAC ’25, 2025, to appear. [Online]. Available: https://arxiv.org/abs/2411.16238
-
[27]
Chatchisel: Enabling agile hardware design with large language models,
T. Liu, Q. Tian, J. Ye, L. Fu, S. Su, J. Li, G.-W. Wan, L. Zhang, S.- Z. Wong, X. Wanget al., “Chatchisel: Enabling agile hardware design with large language models,” in2024 2nd International Symposium of Electronics Design Automation (ISEDA). IEEE, 2024, pp. 710–716
2024
-
[28]
Using LLMs to facilitate formal verification of RTL,
M. Orenes-Vera, M. Martonosi, and D. Wentzlaff, “From rtl to sva: Llm- assisted generation of formal verification testbenches,”arXiv preprint arXiv:2309.09437, 2023
-
[29]
(security) assertions by large language models,
R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, and J. Rajendran, “(security) assertions by large language models,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4374– 4389, 2024
2024
-
[30]
Validatable generation of system verilog asser- tions from natural language specifications,
F. Aditi and M. S. Hsiao, “Validatable generation of system verilog asser- tions from natural language specifications,” in2023 Fifth International Conference on Transdisciplinary AI (TransAI), 2023, pp. 102–109
2023
-
[31]
https://api.semanticscholar.org/CorpusID:274235019
M. Shahidzadeh, B. Ghavami, S. Wilton, and L. Shan- non, “Automatic high-quality verilog assertion generation through subtask-focused fine-tuned llms and iterative prompt- ing,”ArXiv, vol. abs/2411.15442, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:274235019
-
[32]
Towards improving verification productivity with circuit-aware translation of natural language to systemverilog assertions,
C. Sun, C. Hahn, and C. Trippel, “Towards improving verification productivity with circuit-aware translation of natural language to systemverilog assertions,” inFirst International Workshop on Deep Learning-aided Verification, 2023. [Online]. Available: https://openreview.net/forum?id=FKH8qCuM44
2023
-
[33]
Domain- adapted llms for vlsi design and verification: A case study on formal verification,
M. Liu, M. Kang, G. B. Hamad, S. Suhaib, and H. Ren, “Domain- adapted llms for vlsi design and verification: A case study on formal verification,” in2024 IEEE 42nd VLSI Test Symposium (VTS), 2024, pp. 1–4
2024
-
[34]
Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl,
Y . Bai, G. B. Hamad, S. Suhaib, and H. Ren, “Assertionforge: Enhancing formal verification assertion generation with structured representation of specifications and rtl,” in2025 IEEE International Conference on LLM- Aided Design (ICLAD), 2025, pp. 85–92
2025
-
[35]
Chiraag: Chatgpt informed rapid and automated assertion generation,
B. Mali, K. Maddala, V . Gupta, S. Reddy, C. Karfa, and R. Karri, “Chiraag: Chatgpt informed rapid and automated assertion generation,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2024, pp. 680–683
2024
-
[36]
ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications,
C. Xing, S. Wong, X. Wan, Y . Lu, M. Zhang, Z. Ma, L. Qi, Z. Li, N. Guan, Z. Jiang, X. Wang, and J. Yang, “ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications,” inProceedings of the Fourtieth AAAI Conference on Artificial Intelligence, 2026, to appear. [Online]. Available: https://arxiv.org/abs/2512.05371
-
[37]
Verible: Systemverilog developer’s toolkit,
Chips Alliance, “Verible: Systemverilog developer’s toolkit,” https://github.com/chipsalliance/verible, 2019, accessed: 2026-01- 29
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.