pith. machine review for the scientific record. sign in

arxiv: 2604.23100 · v1 · submitted 2026-04-25 · 💻 cs.CR · cs.AR· cs.LO

Recognition: unknown

From Language to Logic: Bridging LLMs & Formal Representations for RTL Assertion Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:14 UTC · model grok-4.3

classification 💻 cs.CR cs.ARcs.LO
keywords large language modelsformal verificationSystemVerilog assertionsRTL designretrieval-augmented generationReAct agenthardware verificationassertion generation
0
0 comments X

The pith

A tool-augmented agent generates SystemVerilog assertions from natural language hardware specifications with 82 percent functional correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models can produce reliable assertions for hardware verification when equipped with autonomous context gathering and iterative formal checking. It shows that an agent first retrieves design details through semantic search over code structures and tool queries, then creates assertions and corrects them across multiple rounds using solver feedback on proof failures. This combination addresses typical model errors such as incorrect references and incomplete timing rules. Evaluation on standard design benchmarks reports 93.7 percent syntax correctness and 82.0 percent functional correctness, with each added element contributing measurable gains. If the approach holds, teams could create verified assertions directly from text descriptions rather than requiring combined expertise in design and temporal logic.

Core claim

The paper claims that a two-phase agent first gathers design context autonomously through retrieval-augmented generation and structural tool queries, then generates and refines assertions using up to three rounds of formal solver feedback, reaching 93.7 percent syntax correctness and 82.0 percent functional correctness on the evaluated benchmarks while an ablation study shows that retrieval, tool integration, and the verification loop each add independent value.

What carries the argument

A two-phase tool-augmented ReAct agent that gathers design context via semantic search and structural queries before iteratively refining assertions through formal proof feedback.

If this is right

  • Assertions for hardware designs can be produced reliably from text specifications with reduced manual effort.
  • Retrieval of design context, formal tool queries, and iterative refinement each improve outcomes independently.
  • Formal verification of digital hardware becomes accessible with less combined expertise in design and temporal logic.
  • Fixed rounds of solver feedback can deliver substantial functional correctness on standard benchmark designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structure of context retrieval plus iterative solver feedback could transfer to generating other formal hardware specifications.
  • Performance on large industrial designs may require adaptive rather than fixed numbers of refinement rounds.
  • Embedding the agent inside existing electronic design automation flows could shorten overall verification timelines.
  • Further gains in functional correctness might appear if the loop incorporates additional sources of design information.

Load-bearing premise

The chosen benchmarks capture the variety of real-world register-transfer level designs and three fixed rounds of solver feedback suffice to reach the reported correctness without overfitting to the test cases.

What would settle it

A sharp drop in functional correctness when the same agent is applied to a fresh collection of register-transfer level designs not included in the original benchmark set.

Figures

Figures reproduced from arXiv: 2604.23100 by Hadi Kamali, Kimia Azar, Nowfel Mashnoor.

Figure 1
Figure 1. Figure 1: shows the ProofLoop framework. Given an RTL design and a natural-language assertion specification, the system operates in 3 stages: (1) AST indexing builds a semantic knowledge base from the RTL, (2) a ReAct agent gathers design context via tool calls in Phase A, and (3) the agent generates SVA, followed by iteratively refining it using JasperGold formal verification in Phase B (see Alg. 1). B. AST Indexin… view at source ↗
Figure 2
Figure 2. Figure 2: Functionality by design complexity. The pipeline’s advantage grows substantially with design size, as the baseline struggles with multi-module designs that exceed LLM context capacity. outperforms in both design categories. The improvement is particularly pronounced on pipeline designs, where multi￾module structural complexity makes tool-augmented con￾text gathering essential; the FVEval baselines rely on … view at source ↗
read the original abstract

SystemVerilog Assertions (SVA) are essential for formal verification of digital hardware, yet their manual creation demands significant expertise in both the design under verification and temporal logic. Recent studies have explored using large language models (LLMs) to automate SVA generation, but existing approaches suffer from incorrect signal references, missing timing constraints, and lack of formal correctness guarantees. This paper presents ProofLoop, a tool-augmented ReAct agent that generates SVA from natural-language specifications using a solver-in-the-loop approach. The agent operates in two phases: Phase A autonomously gathers design context by invoking EDA and formal tools, including semantic search over an AST-indexed vector database and JasperGold structural queries, while Phase B generates SVA and iteratively refines it using JasperGold formal proof feedback over up to fixed (here 3) verification rounds. We evaluate ProofLoop on FVEval Design2SVA design benchmarks and demonstrate that this framework can achieve 93.7% syntax correctness and 82.0% functional correctness. An ablation study confirms that each component, i.e., retrieval-augmented generation (RAG), JasperGold tools, and the verification loop contributes significantly (and orthogonally).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents ProofLoop, a ReAct-style LLM agent that generates SystemVerilog Assertions (SVA) from natural-language specifications. Phase A gathers design context via semantic search over an AST-indexed vector database and JasperGold structural queries; Phase B generates candidate assertions and refines them iteratively (up to a fixed 3 rounds) using JasperGold proof feedback. On the FVEval Design2SVA benchmark suite the framework reports 93.7% syntax correctness and 82.0% functional correctness; an ablation study claims that RAG, JasperGold tooling, and the verification loop each contribute significantly and orthogonally.

Significance. If the reported correctness figures prove robust under broader evaluation, the work would meaningfully advance automated formal verification by closing the loop between LLMs and industrial solvers. The explicit use of solver feedback for refinement and the ablation analysis are positive features that distinguish it from purely generative baselines.

major comments (3)
  1. [Evaluation] Evaluation section: the 82.0% functional-correctness figure is presented without any description of the judgment procedure (manual review, equivalence checking against reference assertions, or property coverage metrics), the ground-truth construction for the FVEval Design2SVA designs, or inter-rater / statistical significance tests. This directly undermines the headline performance claim.
  2. [Evaluation] Benchmark description and experimental setup: no information is given on the size, complexity, or stylistic diversity of the FVEval Design2SVA designs, nor on how the suite was selected. Combined with the fixed three-round JasperGold loop that re-uses the same solver for both refinement and scoring, this leaves open the possibility that the reported numbers reflect overfitting to a narrow set of patterns rather than general capability.
  3. [Ablation study] Ablation study: the claim that the three components contribute 'significantly (and orthogonally)' is not supported by any quantitative analysis of interaction effects, variance decomposition, or cross-validation; the ablation table alone does not establish orthogonality.
minor comments (2)
  1. [Abstract] The abstract states 'fixed (here 3)' verification rounds; the main text should explicitly state whether this limit is a hyper-parameter or a hard cap and report sensitivity to round count.
  2. [Method] Notation for the ReAct agent loop (state, action, observation) is introduced without a compact diagram or pseudocode, making the two-phase description harder to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the evaluation and ablation sections.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the 82.0% functional-correctness figure is presented without any description of the judgment procedure (manual review, equivalence checking against reference assertions, or property coverage metrics), the ground-truth construction for the FVEval Design2SVA designs, or inter-rater / statistical significance tests. This directly undermines the headline performance claim.

    Authors: We agree that the manuscript requires explicit details on the functional-correctness judgment procedure. In the revision we will add a dedicated paragraph stating that functional correctness is determined by running the generated SVA through JasperGold and checking whether the property is proved without counterexamples on the provided RTL. Ground-truth assertions are taken directly from the FVEval Design2SVA reference set, which encodes the natural-language intent. Because the metric is produced by an automated formal solver rather than subjective human judgment, inter-rater reliability testing was not performed; we will report the exact number of designs evaluated and any per-design variance. Statistical significance tests will be added where the data permit. revision: yes

  2. Referee: [Evaluation] Benchmark description and experimental setup: no information is given on the size, complexity, or stylistic diversity of the FVEval Design2SVA designs, nor on how the suite was selected. Combined with the fixed three-round JasperGold loop that re-uses the same solver for both refinement and scoring, this leaves open the possibility that the reported numbers reflect overfitting to a narrow set of patterns rather than general capability.

    Authors: We will expand the benchmark section with a table summarizing the FVEval Design2SVA suite: number of designs, average signal count, state-space complexity indicators, and stylistic variety (e.g., control vs. datapath). We will also state the selection criteria (designs drawn from the public FVEval release that contain natural-language specifications). On the reuse of JasperGold for both refinement and final scoring, we will clarify that the three-round cap is a fixed computational budget and that final correctness is measured on the converged assertion; however, we acknowledge the risk of pattern overfitting and will add an explicit limitations paragraph. No new experiments are required for this clarification. revision: partial

  3. Referee: [Ablation study] Ablation study: the claim that the three components contribute 'significantly (and orthogonally)' is not supported by any quantitative analysis of interaction effects, variance decomposition, or cross-validation; the ablation table alone does not establish orthogonality.

    Authors: We accept that the current ablation table alone does not rigorously demonstrate orthogonality. The table shows performance degradation when each component is removed, which we interpreted as independent contributions. In the revision we will either compute simple interaction terms from the existing ablation runs or revise the claim to 'each component contributes substantially, with largely additive effects' while explicitly noting the absence of cross-validation or variance decomposition. This change requires only textual and possibly tabular updates to the existing results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmarks with no derivations or self-referential reductions.

full rationale

The paper describes ProofLoop as a ReAct agent that uses RAG, EDA tools, and JasperGold feedback in fixed rounds to generate and refine SVA from natural language specs. All reported results (93.7% syntax correctness, 82.0% functional correctness, ablation contributions) are direct empirical measurements on the external FVEval Design2SVA benchmark suite using independent solver outputs. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain; the central claims rest on observable tool outputs and benchmark comparisons rather than any reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper with no mathematical free parameters, axioms, or invented theoretical entities. The contribution is an integrated agent architecture evaluated empirically on existing benchmarks and tools.

pith-pipeline@v0.9.0 · 5522 in / 1139 out tokens · 121656 ms · 2026-05-08T08:14:22.424865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    A survey on assertion- based hardware verification,

    H. Witharana, Y . Lyu, S. Charles, and P. Mishra, “A survey on assertion- based hardware verification,”ACM Computing Surveys, vol. 54, no. 11s, pp. 1–33, 2022

  2. [2]

    Vijayaraghavan and M

    S. Vijayaraghavan and M. Ramanathan,A Practical Guide for Sys- temVerilog Assertions. Springer, 2005

  3. [3]

    E. M. Clarke, T. A. Henzinger, H. Veith, and R. Bloem,Handbook of Model Checking. Springer, 2018

  4. [4]

    Special session: CAD for hardware security—automation is key to adoption of solutions,

    S. Aftabjahani, R. Kastner, M. Tehranipoor, F. Farahmandi, J. Oberg, A. Nordstrom, N. Fern, and A. Althoff, “Special session: CAD for hardware security—automation is key to adoption of solutions,” in2021 IEEE 39th VLSI Test Symposium (VTS). IEEE, 2021, pp. 1–10

  5. [5]

    GoldMine: Automatic assertion generation using data mining and static analysis,

    S. Vasudevan, D. Sheridan, S. Patel, D. Tcheng, B. Tuohy, and D. John- son, “GoldMine: Automatic assertion generation using data mining and static analysis,” in2010 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2010, pp. 626–629

  6. [6]

    HARM: A hint-based assertion miner,

    S. Germiniani and G. Pravadelli, “HARM: A hint-based assertion miner,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4277–4288, 2022

  7. [7]

    A-Team: Automatic template-based assertion miner,

    A. Danese, N. Dalla Riva, and G. Pravadelli, “A-Team: Automatic template-based assertion miner,” inProceedings of the 54th Annual Design Automation Conference (DAC). ACM, 2017, pp. 1–6

  8. [8]

    Au- toSV A: Democratizing formal verification of RTL module interactions,

    M. Orenes-Vera, A. Manocha, D. Wentzlaff, and M. Martonosi, “Au- toSV A: Democratizing formal verification of RTL module interactions,” inProceedings of the 58th ACM/IEEE Design Automation Conference (DAC). ACM/IEEE, 2021, pp. 535–540

  9. [9]

    SoC se- curity verification using property checking,

    N. Farzana, F. Rahman, M. Tehranipoor, and F. Farahmandi, “SoC se- curity verification using property checking,” in2019 IEEE International Test Conference (ITC). IEEE, 2019, pp. 1–10

  10. [10]

    (security) assertions by large language models,

    R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, and J. Rajendran, “(security) assertions by large language models,”IEEE Transactions on Information Forensics and Security, 2024

  11. [11]

    Using LLMs to facilitate formal verification of RTL,

    M. Orenes-Vera, M. Martonosi, and D. Wentzlaff, “Using LLMs to facilitate formal verification of RTL,”arXiv preprint arXiv:2309.09437, 2023

  12. [12]

    Towards improving verification productivity with circuit-aware translation of natural language to Sys- temVerilog assertions,

    C. Sun, C. Hahn, and C. Trippel, “Towards improving verification productivity with circuit-aware translation of natural language to Sys- temVerilog assertions,” inProceedings of the DAV Workshop, 2023

  13. [13]

    71 Shailja Thakur, Baleegh Ahmad, Hammond Fan, et al

    S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “VeriGen: A large language model for Verilog code generation,” arXiv preprint arXiv:2308.00708, 2023

  14. [14]

    Decortl: A run-time decoding framework for rtl code generation with llms,

    M. Akyash, K. Azar, and H. Kamali, “Decortl: A run-time decoding framework for rtl code generation with llms,” in2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 2025, pp. 1–9

  15. [15]

    Rtl- coder: Fully open-source and efficient llm-assisted rtl code generation technique,

    S. Liu, W. Fang, Y . Lu, J. Wang, Q. Zhang, H. Zhang, and Z. Xie, “Rtl- coder: Fully open-source and efficient llm-assisted rtl code generation technique,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 44, no. 4, pp. 1448–1461, 2024

  16. [16]

    Rtl++: Graph-enhanced llm for rtl code generation,

    M. Akyash, K. Azar, and H. Kamali, “Rtl++: Graph-enhanced llm for rtl code generation,” in2025 IEEE International Conference on LLM-Aided Design (ICLAD). IEEE, 2025, pp. 44–50

  17. [17]

    Codev: Empowering llms with hdl generation through multi-level summarization,

    Y . Zhao, D. Huang, C. Li, P. Jin, M. Song, Y . Xu, Z. Nan, M. Gao, T. Ma, L. Qiet al., “Codev: Empowering llms with hdl generation through multi-level summarization,”IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 2025

  18. [18]

    Meltrtl: Multi- expert llms with inference-time intervention for rtl code generation,

    N. Mashnoor, M. Akyash, H. Kamali, and K. Azar, “Meltrtl: Multi- expert llms with inference-time intervention for rtl code generation,” arXiv preprint arXiv:2601.13015, 2026

  19. [19]

    Towards llm-powered verilog rtl assistant: Self-verification and self-correction,

    H. Huang, Z. Lin, Z. Wang, X. Chen, K. Ding, and J. Zhao, “Towards llm-powered verilog rtl assistant: Self-verification and self-correction,” arXiv preprint arXiv:2406.00115, 2024

  20. [20]

    Llm-ift: Llm- powered information flow tracking for secure hardware,

    N. Mashnoor, M. Akyash, H. Kamali, and K. Azar, “Llm-ift: Llm- powered information flow tracking for secure hardware,” in2025 IEEE 43rd VLSI Test Symposium (VTS). IEEE, 2025, pp. 1–5

  21. [21]

    AssertionBench: A benchmark to evaluate large-language models for assertion generation,

    V . Pulavarthi, D. Nandal, S. Dan, and D. Pal, “AssertionBench: A benchmark to evaluate large-language models for assertion generation,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025

  22. [22]

    Domain- adapted LLMs for VLSI design and verification: A case study on formal verification,

    M. Kang, M. Liu, G. Bany Hamad, S. Suhaib, and H. Ren, “Domain- adapted LLMs for VLSI design and verification: A case study on formal verification,” in2024 IEEE 42nd VLSI Test Symposium (VTS). IEEE, 2024

  23. [23]

    Retrieval- augmented generation for knowledge-intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2020

  24. [24]

    DeepV: A model-agnostic retrieval- augmented framework for Verilog code generation with a high-quality knowledge base,

    Z. Ibnat, P. E. Calzada, R. M. Ihtemam, S. K. Saha, J. Zhou, F. Farah- mandi, and M. Tehranipoor, “DeepV: A model-agnostic retrieval- augmented framework for Verilog code generation with a high-quality knowledge base,”arXiv preprint arXiv:2510.05327, 2025

  25. [25]

    FVEval: Understanding language model capabilities in formal verification of digital hardware,

    M. Kang, M. Liu, G. Bany Hamad, S. Suhaib, and H. Ren, “FVEval: Understanding language model capabilities in formal verification of digital hardware,”arXiv preprint arXiv:2410.23299, 2024

  26. [26]

    ChI- RAAG: ChatGPT informed rapid and automated assertion generation,

    B. Mali, K. Maddala, V . Gupta, S. Reddy, C. Karfa, and R. Karri, “ChI- RAAG: ChatGPT informed rapid and automated assertion generation,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2024, pp. 680–683

  27. [27]

    Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

    W. Fang, M. Li, M. Li, Z. Yan, S. Liu, H. Zhang, and Z. Xie, “AssertLLM: Generating and evaluating hardware verification asser- tions from design specifications via multi-LLMs,”arXiv preprint arXiv:2402.00386, 2024

  28. [28]

    SANGAM: SystemVerilog as- sertion generation via Monte Carlo tree self-refine,

    A. Gupta, B. Mali, and C. Karfa, “SANGAM: SystemVerilog as- sertion generation via Monte Carlo tree self-refine,”arXiv preprint arXiv:2506.13983, 2025

  29. [29]

    Saarthi: The first AI formal verification engineer,

    A. Kumar, D. N. Gadde, K. K. Radhakrishna, and D. Lettnin, “Saarthi: The first AI formal verification engineer,”arXiv preprint arXiv:2502.16662, 2025

  30. [30]

    AssertionForge: Enhancing formal verification assertion generation with structured rep- resentation of specifications and RTL,

    Y . Bai, G. Bany Hamad, S. Suhaib, and H. Ren, “AssertionForge: Enhancing formal verification assertion generation with structured rep- resentation of specifications and RTL,” in2025 IEEE International Conference on LLM-Aided Design (ICLAD). IEEE, 2025

  31. [31]

    SV Agent: AI agent for hardware security verification assertion,

    R. Guo, A. Ayalasomayajula, H. Li, J. Zhou, S. K. Saha, and F. Farah- mandi, “SV Agent: AI agent for hardware security verification assertion,” arXiv preprint arXiv:2507.16203, 2025

  32. [32]

    ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

    L. T. Fu, J. Zhou, S. Ren, M. Zhang, J. Xiong, H. Jiang, N. Guan, X. Wang, and J. Yang, “ChatSV A: Bridging SV A genera- tion for hardware verification via task-specific LLMs,”arXiv preprint arXiv:2604.02811, 2026

  33. [33]

    Enhancing large language models for hardware ver- ification: A novel SystemVerilog assertion dataset,

    A. Menonet al., “Enhancing large language models for hardware ver- ification: A novel SystemVerilog assertion dataset,”ACM Transactions on Design Automation of Electronic Systems, 2025

  34. [34]

    Rtlfixer: Automatically fixing rtl syntax errors with large language models,

    Y .-D. Tsai, M. Liu, and H. Ren, “RTLFixer: Automatically fix- ing RTL syntax errors with large language models,”arXiv preprint arXiv:2311.16543, 2023

  35. [35]

    Automatically improving LLM-based Verilog generation using EDA tool feedback,

    J. Blocklove, S. Thakur, B. Tan, H. Pearce, S. Garg, and R. Karri, “Automatically improving LLM-based Verilog generation using EDA tool feedback,”ACM Transactions on Design Automation of Electronic Systems, 2025

  36. [36]

    LLM-guided formal verification coupled with mutation testing,

    M. Hassan, S. Ahmadi-Pour, K. Qayyum, C. K. Jha, and R. Drechsler, “LLM-guided formal verification coupled with mutation testing,” in 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024, pp. 1–6

  37. [37]

    Yosys—a free Verilog synthesis suite,

    C. Wolf, J. Glaser, and J. Kepler, “Yosys—a free Verilog synthesis suite,” inProceedings of the 21st Austrian Workshop on Microelectronics (Austrochip), vol. 97, 2013