pith. machine review for the scientific record. sign in

arxiv: 2604.27643 · v1 · submitted 2026-04-30 · 💻 cs.AR · cs.AI

Recognition: unknown

HAVEN: Hybrid Automated Verification ENgine for UVM Testbench Synthesis with LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:27 UTC · model grok-4.3

classification 💻 cs.AR cs.AI
keywords UVM testbenchLLM-assisted verificationtemplate enginedomain-specific languagehardware verificationIC designcoverage metricsAXI4-Lite
0
0 comments X

The pith

HAVEN generates UVM testbenches and sequences by directing LLMs to plan architecture while using fixed protocol templates and a DSL for the actual code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HAVEN to automate UVM testbench creation for integrated circuit verification, where manual effort currently consumes nearly 70 percent of development time. LLMs analyze design specifications to build a high-level plan, but the system avoids direct HDL generation by feeding that plan into a template engine that fills in all UVM components from predefined, protocol-specific Jinja2 templates. For test sequences, HAVEN defines a Protocol-Aware DSL that breaks sequences into step types; predefined patterns deliver strong initial coverage, and LLM agents then read coverage gap reports to add targeted sequences. Experiments on 19 open-source IP designs across Direct, Wishbone, and AXI4-Lite protocols report 100 percent compilation success together with average code coverage of 90.6 percent and functional coverage of 87.9 percent.

Core claim

HAVEN is the first system that utilizes pre-defined, protocol-specific Jinja2 templates to generate all UVM components and UVM sequences using our proposed Protocol-Aware DSL and rule-based code generator. LLM agents produce an architectural plan from specifications; the Template Engine then combines this plan with the templates to produce components that include correct bus-handshake timing. Sequences begin with predefined DSL patterns for high baseline coverage; LLM agents iteratively analyze coverage gap reports and compose additional DSL sequences. On 19 open-source IP designs spanning three interface protocols the system achieves 100 percent compilation success, 90.6 percent code cover,

What carries the argument

HAVEN Template Engine paired with the Protocol-Aware Sequence Domain-Specific Language (DSL) and rule-based code generator that fills predefined Jinja2 templates instead of letting LLMs emit HDL directly.

If this is right

  • All UVM components for the three protocols are produced with correct bus-handshake timing without any direct HDL output from the LLM.
  • Predefined DSL patterns establish a high baseline coverage rate before any LLM involvement in sequence creation.
  • Iterative LLM analysis of coverage gap reports produces additional sequences that raise both code and functional coverage.
  • The hybrid pipeline yields 100 percent compilation success and average coverage of 90.6 percent code and 87.9 percent functional across 19 designs.
  • The method outperforms prior LLM-assisted UVM generation approaches on the same set of protocols and designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adding support for a new protocol would require only new templates and DSL patterns, after which the same LLM planning and gap-analysis loop could be reused.
  • The separation between LLM planning and template-driven code generation could be applied to other syntax-heavy generation tasks where raw LLM output is unreliable.
  • Verification teams could shift effort from writing boilerplate UVM code to reviewing the architectural plans and coverage reports produced by the system.
  • The DSL itself could serve as a readable, human-editable intermediate representation even if the LLM layer were removed.

Load-bearing premise

Predefined protocol-specific templates and DSL patterns can be written and maintained to correctly produce every required UVM component and bus-handshake timing for the supported protocols, while LLM agents can reliably interpret coverage gap reports to add sequences without omissions or new errors.

What would settle it

Apply HAVEN to an open-source IP core that uses one of the three tested protocols but contains timing or functional features absent from the original 19 designs, then check whether compilation still succeeds at 100 percent and measured coverage remains above 85 percent.

Figures

Figures reproduced from arXiv: 2604.27643 by Chang-Chih Meng, Guan-Yu Lin, I-Chen Wu, Kai-Chiang Wu, Tsung Tai Yeh, Yu-Ren Lu.

Figure 1
Figure 1. Figure 1: Stage 1: Testbench Generation pipeline (driver path shown as a worked example for SPI). The Spec Processor and Architecture agents convert the view at source ↗
Figure 2
Figure 2. Figure 2: Stage 2: Sequence Optimization pipeline. The DSL Generator agent reads the Structured Spec, UVM Blueprint, Protocol Flows, and the coverage gap view at source ↗
Figure 3
Figure 3. Figure 3: Driver comparison. (a) LLM-generated code contains three common view at source ↗
Figure 4
Figure 4. Figure 4: Sequence generation comparison. (a) LLM-generated SystemVerilog view at source ↗
read the original abstract

Integrated Circuit (IC) verification consumes nearly 70% of the IC development cycle, and recent research leverages Large Language Models (LLMs) to automatically generate testbenches and reduce verification overhead. However, LLMs have difficulty generating testbenches correctly. Unlike high-level programming languages, Hardware Description Languages (HDLs) are extremely rare in LLMs training data, leading LLMs to produce incorrect code. To overcome challenges when using LLMs to generate Universal Verification Methodology (UVM) testbenches and sequences, wepropose HAVEN (Hybrid Automated Verification ENgine) to prevent LLMs from writing HDL directly. For UVM testbench generation, HAVEN utilizes LLM agents to analyze design specifications to produce a structured architectural plan. The HAVEN Template Engine then combines with predefined and protocol-specific templates to generate all UVM components with correct bus-handshake timing. For UVM sequence generation, HAVEN introduces a Protocol-Aware Sequence Domain-Specific Language (DSL) that decomposes sequences into fine-grained step types. A set of predefined DSL patterns first establishes sequences that achieve a high coverage rate without LLM involvement. HAVEN continues to improve the coverage rate by iteratively leveraging LLM agents to analyze coverage gap reports and compose additional targeted DSL sequences. Unlike previous works, HAVEN is the first system that utilizes pre-defined, protocol-specific Jinja2 templates to generate all UVM components and UVM sequences using our proposed Protocol-Aware DSL and rule-based code generator. Our experimental results on 19 open-source IP designs spanning three interface protocols (Direct, Wishbone, AXI4-Lite) show that HAVEN achieves 100% compilation success, 90.6% code coverage, and 87.9% functional coverage on average, and is SOTA among LLM-assisted testbench generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces HAVEN, a hybrid system for automated UVM testbench and sequence generation that combines LLM agents for design analysis and coverage-gap closure with a Template Engine using pre-defined protocol-specific Jinja2 templates and a Protocol-Aware Sequence DSL plus rule-based generator. It claims this prevents direct HDL generation by LLMs, achieves 100% compilation success, 90.6% code coverage, and 87.9% functional coverage on average across 19 open-source IP designs for Direct, Wishbone, and AXI4-Lite protocols, and is the first such system to use these templates for all UVM components and sequences, making it SOTA among LLM-assisted approaches.

Significance. If the template correctness and experimental reproducibility hold, HAVEN could meaningfully lower the verification overhead (claimed at ~70% of IC development) by providing a more reliable hybrid path than pure LLM generation for UVM testbenches. The hybrid design with DSL decomposition and iterative coverage-driven sequence generation is a practical engineering contribution that could be adopted in industry flows if the templates prove maintainable and generalizable beyond the three protocols.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (HAVEN Template Engine): The central claim that HAVEN is the first system to use pre-defined protocol-specific Jinja2 templates to generate all UVM components with correct bus-handshake timing rests on the assumption that these manually authored templates fully encode every legal timing, reset, and phasing rule for Direct, Wishbone, and AXI4-Lite. No section describes validation of the templates against an independent reference UVM implementation or stress-testing on corner cases (e.g., back-to-back transactions with variable wait states). Without this, the reported 100% compilation success is not demonstrably reproducible or generalizable to new designs exercising untested rules.
  2. [§5] §5 (Experimental Results): The headline metrics (100% compilation, 90.6% code coverage, 87.9% functional coverage) are reported as averages over 19 designs without per-design tables, explicit selection criteria for the open-source IPs, or full baseline comparisons against prior LLM-assisted UVM systems. This makes it impossible to verify the SOTA claim or assess whether the hybrid pipeline's non-LLM half is the decisive factor versus the LLM coverage-closure loop.
  3. [§4] §4 (Protocol-Aware Sequence DSL and LLM agents): The iterative coverage-closure loop assumes LLM agents can reliably analyze gap reports and compose additional DSL sequences without introducing errors or missing cases. The paper provides no failure-mode analysis, error rates for the LLM step, or fallback mechanisms when the rule-based generator plus LLM composition fails to reach the reported functional coverage.
minor comments (3)
  1. [Abstract] Abstract contains a typo: 'wepropose' should be 'we propose'.
  2. [§3] The paper should clarify whether the protocol-specific Jinja2 templates and DSL patterns will be released as open source; without them, the reproducibility of the 100% compilation result is limited.
  3. [§4] Notation for the DSL step types and coverage metrics could be formalized in a table for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our paper. We have addressed each major comment point by point below, proposing revisions to the manuscript where appropriate to improve its clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (HAVEN Template Engine): The central claim that HAVEN is the first system to use pre-defined protocol-specific Jinja2 templates to generate all UVM components with correct bus-handshake timing rests on the assumption that these manually authored templates fully encode every legal timing, reset, and phasing rule for Direct, Wishbone, and AXI4-Lite. No section describes validation of the templates against an independent reference UVM implementation or stress-testing on corner cases (e.g., back-to-back transactions with variable wait states). Without this, the reported 100% compilation success is not demonstrably reproducible or generalizable to new designs exercising untested rules.

    Authors: We appreciate this observation. The templates in HAVEN were carefully crafted to adhere to the official protocol specifications, which define the legal timing, reset, and phasing rules. However, the original manuscript lacks explicit details on their validation. We will revise Section 3 to include a description of the template validation methodology, including cross-verification with standard UVM reference implementations and targeted testing for corner cases like back-to-back transactions with varying wait states. This addition will better support the reproducibility of our 100% compilation success rate and the generalizability of the approach. revision: yes

  2. Referee: [§5] §5 (Experimental Results): The headline metrics (100% compilation, 90.6% code coverage, 87.9% functional coverage) are reported as averages over 19 designs without per-design tables, explicit selection criteria for the open-source IPs, or full baseline comparisons against prior LLM-assisted UVM systems. This makes it impossible to verify the SOTA claim or assess whether the hybrid pipeline's non-LLM half is the decisive factor versus the LLM coverage-closure loop.

    Authors: We agree that aggregate metrics alone are insufficient for thorough evaluation. In the revised version, we will include a detailed table presenting per-design results for compilation success, code coverage, and functional coverage across all 19 designs. Additionally, we will specify the selection criteria for the open-source IPs, which were selected from established repositories to represent a range of complexities for the Direct, Wishbone, and AXI4-Lite protocols. We will also enhance the baseline comparisons by providing more comprehensive analysis against prior LLM-assisted UVM systems, emphasizing the role of the non-LLM components (templates and DSL) in achieving the reported performance. This will allow readers to better verify the SOTA claim and evaluate the hybrid pipeline's contributions. revision: yes

  3. Referee: [§4] §4 (Protocol-Aware Sequence DSL and LLM agents): The iterative coverage-closure loop assumes LLM agents can reliably analyze gap reports and compose additional DSL sequences without introducing errors or missing cases. The paper provides no failure-mode analysis, error rates for the LLM step, or fallback mechanisms when the rule-based generator plus LLM composition fails to reach the reported functional coverage.

    Authors: This is a valid point regarding the robustness of the LLM agents. The original paper focuses on the overall success of the system but does not detail the internal performance of the iterative loop. We will add to Section 4 an analysis of failure modes observed during our experiments, such as cases where the LLM suggested incomplete sequences, along with approximate error rates derived from our experimental logs. Furthermore, we will describe the fallback mechanisms, including prompt engineering retries and reliance on the predefined DSL patterns when the LLM step does not improve coverage. These additions will provide a more complete picture of how the system maintains high functional coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: applied system paper with empirical results only

full rationale

HAVEN is a hybrid engineering system paper whose central claims rest on the explicit construction of protocol-specific Jinja2 templates, a rule-based DSL generator, and LLM agents for coverage closure, followed by direct experimental measurement on 19 open-source designs. No equations, fitted parameters, predictions, or derivations appear anywhere in the provided text. The performance numbers (100% compilation success, 90.6% code coverage, 87.9% functional coverage) are reported as measured outcomes rather than derived quantities that reduce to prior results by construction. The novelty statement (“first system that utilizes pre-defined, protocol-specific Jinja2 templates”) is an empirical claim about the implemented pipeline, not a self-referential or self-citation-dependent theorem. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The derivation chain is therefore self-contained through implementation description and external benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on domain assumptions about LLM reliability in analysis tasks and the sufficiency of manually created templates and DSL patterns for the protocols tested. No new physical entities or free parameters are introduced; the system is built from existing LLM capabilities and template technology.

axioms (2)
  • domain assumption LLMs are capable of producing accurate architectural plans from design specifications and analyzing coverage gap reports to generate targeted sequences.
    This is invoked in the description of LLM agents for planning and iterative improvement.
  • domain assumption Predefined protocol-specific templates can correctly implement all UVM components with proper bus-handshake timing for the supported protocols.
    Central to the Template Engine's ability to generate correct code without LLM writing HDL.
invented entities (2)
  • Protocol-Aware Sequence Domain-Specific Language (DSL) no independent evidence
    purpose: To allow decomposition of sequences into fine-grained step types for rule-based generation and LLM-assisted extension.
    Newly proposed in the paper to structure sequence generation.
  • HAVEN Template Engine no independent evidence
    purpose: To combine LLM-generated plans with predefined Jinja2 templates for generating UVM components.
    Core innovation to prevent direct LLM HDL generation.

pith-pipeline@v0.9.0 · 5655 in / 1738 out tokens · 73612 ms · 2026-05-07T05:27:20.582056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    2024 wilson research group ic/asic functional verification trend report,

    H. Foster, “2024 wilson research group ic/asic functional verification trend report,”Siemens Digital Industries Software, 2025

  2. [2]

    The 2022 wilson research group functional verification study,

    H. Fosteret al., “The 2022 wilson research group functional verification study,”Siemens. com, 2022

  3. [3]

    International technology roadmap for semicon- ductors,

    S. I. Associationet al., “International technology roadmap for semicon- ductors,”http://www. itrs. net, 2009

  4. [4]

    Llm4eda: Emerging progress in large language models for electronic design automation,

    R. Zhong, X. Du, S. Kai, Z. Tang, S. Xu, H.-L. Zhen, J. Hao, Q. Xu, M. Yuan, and J. Yan, “Llm4eda: Emerging progress in large language models for electronic design automation,”arXiv preprint arXiv:2401.12224, 2023

  5. [5]

    A survey of research in large language models for electronic design automation,

    J. Pan, G. Zhou, C.-C. Chang, I. Jacobson, J. Hu, and Y . Chen, “A survey of research in large language models for electronic design automation,” ACM Transactions on Design Automation of Electronic Systems, vol. 30, no. 3, pp. 1–21, 2025

  6. [6]

    Meic: Re-thinking rtl debug automation using llms,

    K. Xu, J. Sun, Y . Hu, X. Fang, W. Shan, X. Wang, and Z. Jiang, “Meic: Re-thinking rtl debug automation using llms,” inProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024, pp. 1–9

  7. [7]

    Uvllm: An automated universal rtl verification framework using llms,

    Y . Hu, J. Ye, K. Xu, J. Sun, S. Zhang, X. Jiao, D. Pan, J. Zhou, N. Wang, W. Shanet al., “Uvllm: An automated universal rtl verification framework using llms,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

  8. [8]

    Llm4dv: Using large language models for hardware test stimuli generation,

    Z. Zhang, B. Szekely, P. Gimenes, G. Chadwick, H. McNally, J. Cheng, R. Mullins, and Y . Zhao, “Llm4dv: Using large language models for hardware test stimuli generation,” in2025 IEEE 33rd Annual Interna- tional Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2025, pp. 133–137

  9. [9]

    Verigen: A large language model for verilog code generation,

    S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “Verigen: A large language model for verilog code generation,” ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 3, pp. 1–31, 2024

  10. [10]

    Chipnemo: Domain-adapted llms for chip design,

    M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktarogluet al., “Chipnemo: Domain- adapted llms for chip design,”arXiv preprint arXiv:2311.00176, 2023

  11. [11]

    Christiaan Baaij, Matthijs Kooijman, Jan Kuper, Arjan Boeijink, and Marco Gerards

    K. Chang, Y . Wang, H. Ren, M. Wang, S. Liang, Y . Han, H. Li, and X. Li, “Chipgpt: How far are we from natural language hardware design,”arXiv preprint arXiv:2305.14019, 2023

  12. [12]

    Assertllm: Generating hardware verification assertions from design specifications via multi-llms,

    Z. Yan, W. Fang, M. Li, M. Li, S. Liu, Z. Xie, and H. Zhang, “Assertllm: Generating hardware verification assertions from design specifications via multi-llms,” inProceedings of the 30th Asia and South Pacific Design Automation Conference, 2025, pp. 614–621

  13. [13]

    Revolution or hype? seeking the limits of large models in hardware design,

    Q. Xu, L. Stok, R. Drechsler, X. Wang, G. L. Zhang, and I. L. Markov, “Revolution or hype? seeking the limits of large models in hardware design,” in2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 2025, pp. 1–9

  14. [14]

    Understanding and mitigating errors of llm-generated rtl code,

    J. Zhang, C. Liu, L. Cheng, X. Li, and H. Li, “Understanding and mitigating errors of llm-generated rtl code,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2026

  15. [15]

    Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation,

    Z. Zhang, C. Wang, Y . Wang, E. Shi, Y . Ma, W. Zhong, J. Chen, M. Mao, and Z. Zheng, “Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 481–503, 2025

  16. [16]

    A comprehensive survey of hallucination mitigation techniques in large language models

    S. Tonmoy, S. Zaman, V . Jain, A. Rani, V . Rawte, A. Chadha, and A. Das, “A comprehensive survey of hallucination mitigation techniques in large language models,”arXiv preprint arXiv:2401.01313, vol. 6, 2024

  17. [17]

    Autobench: Automatic testbench generation and evaluation using llms for hdl design,

    R. Qiu, G. L. Zhang, R. Drechsler, U. Schlichtmann, and B. Li, “Autobench: Automatic testbench generation and evaluation using llms for hdl design,” inProceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, 2024, pp. 1–10

  18. [18]

    Correctbench: Automatic testbench generation with functional self-correction using llms for hdl design,

    ——, “Correctbench: Automatic testbench generation with functional self-correction using llms for hdl design,” in2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 2025, pp. 1–7

  19. [19]

    Confibench: Automatic testbench generation with confidence- based scenario mask and testbench ensemble using llms for hdl design,

    R. Qiu, G. L. Zhang, R. Drechsler, T. Ho, U. Schlichtmann, and B. Li, “Confibench: Automatic testbench generation with confidence- based scenario mask and testbench ensemble using llms for hdl design,” ACM Transactions on Design Automation of Electronic Systems, 2025

  20. [20]

    Rtlfixer: Automatically fixing rtl syntax errors with large language model,

    Y . Tsai, M. Liu, and H. Ren, “Rtlfixer: Automatically fixing rtl syntax errors with large language model,” inProceedings of the 61st ACM/IEEE Design Automation Conference, 2024, pp. 1–6

  21. [21]

    Hdldebugger: Streamlining hdl debugging with large lan- guage models,

    X. Yao, H. Li, T. H. Chan, W. Xiao, M. Yuan, Y . Huang, L. Chen, and B. Yu, “Hdldebugger: Streamlining hdl debugging with large lan- guage models,”ACM Transactions on Design Automation of Electronic Systems, vol. 30, no. 6, pp. 1–26, 2025

  22. [22]

    Ieee standard for universal verification method- ology language reference manual,

    I. S. Associationet al., “Ieee standard for universal verification method- ology language reference manual,”IEEE Std, vol. 2020, pp. 1–458, 1800

  23. [23]

    From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification

    J. Ye, Y . Hu, K. Xu, D. Pan, Q. Chen, J. Zhou, S. Zhao, X. Fang, X. Wang, N. Guanet al., “From concept to practice: an auto- mated llm-aided uvm machine for rtl verification,”arXiv preprint arXiv:2504.19959, 2025

  24. [24]

    Efficient stimuli generation using reinforcement learning in design veri- fication,

    D. N. Gadde, T. Nalapat, A. Kumar, D. Lettnin, W. Kunz, and S. Simon, “Efficient stimuli generation using reinforcement learning in design veri- fication,” in2024 20th International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD). IEEE, 2024, pp. 1–4

  25. [25]

    Coverage directed test generation for functional verification using bayesian networks,

    S. Fine and A. Ziv, “Coverage directed test generation for functional verification using bayesian networks,” inProceedings of the 40th annual Design Automation Conference, 2003, pp. 286–291

  26. [26]

    Optimizing constrained random verification with ml and bayesian estimation,

    B. Kumar, G. Parthasarathy, S. Nanda, and S. Rajakumar, “Optimizing constrained random verification with ml and bayesian estimation,” in 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (ML- CAD). IEEE, 2023, pp. 1–6

  27. [27]

    Survey of machine learning for software-assisted hardware design verification: Past, present, and prospect,

    N. Wu, Y . Li, H. Yang, H. Chen, S. Dai, C. Hao, C. Yu, and Y . Xie, “Survey of machine learning for software-assisted hardware design verification: Past, present, and prospect,”ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 4, pp. 1–42, 2024

  28. [28]

    Coverage-directed test generation auto- mated by machine learning–a review,

    C. Ioannides and K. I. Eder, “Coverage-directed test generation auto- mated by machine learning–a review,”ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 17, no. 1, pp. 1–21, 2012

  29. [29]

    Grammar prompting for domain-specific language generation with large language models,

    B. Wang, Z. Wang, X. Wang, Y . Cao, R. A Saurous, and Y . Kim, “Grammar prompting for domain-specific language generation with large language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 65 030–65 055, 2023

  30. [30]

    Generating structured outputs from language models: Benchmark and studies,

    S. Geng, H. Cooper, M. Moskal, S. Jenkins, J. Berman, N. Ranchin, R. West, E. Horvitz, and H. Nori, “Generating structured outputs from language models: Benchmark and studies,”arXiv e-prints, pp. arXiv– 2501, 2025

  31. [31]

    Coverage metrics for functional validation of hardware designs,

    S. Tasiran and K. Keutzer, “Coverage metrics for functional validation of hardware designs,”IEEE Design & Test of Computers, vol. 18, no. 4, pp. 36–45, 2002

  32. [32]

    Constraint-based random stimuli generation for hardware verification,

    Y . Naveh, M. Rimon, I. Jaeger, Y . Katz, M. Vinov, E. s Marcu, and G. Shurek, “Constraint-based random stimuli generation for hardware verification,”AI magazine, vol. 28, no. 3, pp. 13–13, 2007

  33. [33]

    Llm4cov: Execution-aware agentic learning for high-coverage testbench genera- tion,

    H. Zhang, Z. Yu, C.-T. Ho, H. Ren, B. Khailany, and J. Zhao, “Llm4cov: Execution-aware agentic learning for high-coverage testbench genera- tion,”arXiv preprint arXiv:2602.16953, 2026

  34. [34]

    Tb or not tb: Coverage- driven direct preference optimization for verilog stimulus generation,

    B. Nadimi, K. Filom, D. Chen, and H. Zheng, “Tb or not tb: Coverage- driven direct preference optimization for verilog stimulus generation,” arXiv preprint arXiv:2511.15767, 2025

  35. [35]

    Realbench: Benchmarking verilog generation models with real-world ip designs.arXiv preprint arXiv:2507.16200,

    P. Jin, D. Huang, C. Li, S. Cheng, Y . Zhao, X. Zheng, J. Zhu, S. Xing, B. Dou, R. Zhanget al., “Realbench: Benchmarking verilog generation models with real-world ip designs,”arXiv preprint arXiv:2507.16200, 2025

  36. [36]

    Fixme: Towards end-to-end bench- marking of llm-aided design verification,

    G.-W. Wan, S. Wong, S. Su, C. Niu, N. Wang, X. Wan, Q. Chen, M. Xing, J. Zhang, J. Yeet al., “Fixme: Towards end-to-end bench- marking of llm-aided design verification,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 1087– 1095

  37. [37]

    Verilogeval: Evaluating large language models for verilog code generation,

    M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Verilogeval: Evaluating large language models for verilog code generation,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–8

  38. [38]

    Rtllm: An open-source benchmark for design rtl generation with large language model,

    Y . Lu, S. Liu, Q. Zhang, and Z. Xie, “Rtllm: An open-source benchmark for design rtl generation with large language model,” in2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2024, pp. 722–727