Recognition: unknown
Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes
read the original abstract
RTL generation is more than code synthesis. Designs must be syntactically valid, synthesizable, correct, hardware-efficient. SOTA evaluations stop at functional correctness and do not measure synthesis and implementation quality. This paper evaluates 32 language models on 202 Verilog tasks from VerilogEval and RTLLM using the Hardware Quality Index (HQI) that combines post-synthesis area, delay, and warnings related to expert references in a Nangate45 45\,nm flow. Three performance regimes emerge: 14 frontier models achieve HQI $>$ 66, led by Gemini-3-Pro at 87.5\% coverage and 85.1 HQI; 15 models cluster 43--66 HQI; 3 are below 43. Gap between best-of-five capability and single-attempt quality spans 3.7--22.1 HQI points, limiting integration into agentic pipelines. A taxonomy of 195 synthesis failures reveals systematic divergence: proprietary models fail late through elaboration errors and synthesis timeout; open models fail early often due to missing module wrappers and non-synthesizable constructs, a pattern consistent with training corpora skewed toward simulation over synthesis-grade RTL.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLMs for RTL Generation
Hyperparameter configuration in open-source LLMs for RTL generation produces up to 25.5% intra-model pass-rate variation on VerilogEval and RTLLM, exceeding inter-model spreads by 5x with near-zero correlation in opti...
-
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
A survey of LLM applications in secure hardware design covering EDA synthesis, vulnerability analysis, countermeasures, and educational uses.
-
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
LLMs enable RTL code generation and vulnerability analysis in hardware design but introduce data contamination and adversarial risks that require red-teaming and dynamic benchmarking.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.