PDAGENT-BENCH: Characterizing, Grounding, and Architecting LLM Agents for VLSI Physical Design

Chengxuan Wang; Chia-Tung Ho; David Z. Pan; Duo Ding; Haoxing Ren; Qiufeng Li; Quan Cheng; Rongqian Chen; Sizhe Tang; Tian Lan

arxiv: 2606.17253 · v1 · pith:PHDH2H4Ynew · submitted 2026-06-15 · 💻 cs.AR

PDAGENT-BENCH: Characterizing, Grounding, and Architecting LLM Agents for VLSI Physical Design

Qiufeng Li , Rongqian Chen , Quan Cheng , Chengxuan Wang , Sizhe Tang , Wuxi Li , Duo Ding , Chia-Tung Ho

show 4 more authors

Haoxing Ren David Z. Pan Tian Lan Weidong Cao

This is my paper

Pith reviewed 2026-06-27 02:08 UTC · model grok-4.3

classification 💻 cs.AR

keywords LLM agentsVLSI physical designEDA benchmarkscript generationagentic workflowsphysical design automationInnovusmulti-stage reasoning

0 comments

The pith

PDAGENT-BENCH shows current LLMs handle VLSI conceptual questions well but achieve only 42.2 percent success on Innovus script generation and long-horizon workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PDAGENT-BENCH to measure how effectively large language and vision-language models can serve as agents throughout the physical design stage of chip creation. It supplies 353 expert-validated problems that mix conceptual questions with actual industrial design artifacts and executable solutions across five dimensions from basic knowledge to complete flows. Tests on eleven leading models indicate solid results on understanding but clear shortfalls in tool use and extended reasoning chains. Human-skill-enhanced versions of the agent workflows raise end-to-end performance, indicating one workable route to better automation. This evaluation framework matters because physical design involves tight constraints and repeated tool calls where automation has lagged behind front-end design tasks.

Core claim

The central claim is that modern LLMs and VLMs perform competitively on conceptual VLSI physical design tasks yet remain substantially limited in tool-centric execution such as Innovus script generation at 42.2 percent success and in long-horizon multi-stage reasoning, while human-skill-enhanced agentic workflows produce significant gains in complete physical design outcomes, all measured through the new PDAGENT-BENCH suite of 353 problems and its unified workflow framework.

What carries the argument

PDAGENT-BENCH, the benchmark that combines 353 curated problems across five capability dimensions with a unified human-aligned agentic physical design workflow framework for closed-loop evaluation inside realistic EDA tool environments.

If this is right

Models require targeted gains in EDA tool interaction and iterative refinement to close the gap between conceptual understanding and executable physical design.
Human-skill-enhanced agent workflows provide a measurable near-term route to higher end-to-end physical design quality.
Performance varies across the five dimensions, with root-cause analysis and script generation showing the largest shortfalls among tested models.
The benchmark supplies a reproducible yardstick that lets future agents be compared directly on the same set of industrial artifacts.
Limitations in long-horizon reasoning suggest that single-pass or short-context approaches will continue to underperform on multi-stage design flows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same problem set could be reused to track whether newer models close the execution gap over successive releases.
Extending the workflow framework to include visual layout feedback loops might improve handling of geometry-related constraints.
The benchmark structure could be mirrored for adjacent EDA stages such as logic synthesis or timing closure to create comparable agent evaluations.
Teams might test whether fine-tuning on the 353 reference solutions lifts script-generation scores without changing the base model architecture.

Load-bearing premise

The 353 curated problems together with their expert-validated references and the unified workflow framework represent typical industrial VLSI physical design challenges, tool interactions, and constraints without large selection bias.

What would settle it

A new model that reaches above 70 percent success on the full-flow implementation tasks using only its own agentic workflow without added human skills would indicate the reported execution limits are not as general as claimed.

Figures

Figures reproduced from arXiv: 2606.17253 by Chengxuan Wang, Chia-Tung Ho, David Z. Pan, Duo Ding, Haoxing Ren, Qiufeng Li, Quan Cheng, Rongqian Chen, Sizhe Tang, Tian Lan, Weidong Cao, Wuxi Li.

**Figure 2.** Figure 2: (a) Overview of the PDAgent framework. Five specialized agents (Planner, Worker, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Characteristics of the foundational-knowledge subset of [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Characteristics of the root-cause analysis subset of [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Characteristics of the report-comprehension subset of [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Characteristics of the script generation subset of [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: System prompt used for Physical Design Understanding tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: System prompt used at inference time for the script Generation tasks (Innovus, ICC2, ECO, [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: LLM-Judge prompt used to score model responses on the Physical Design Understanding [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Example question and reference answer from the Basic benchmark (Physical Design [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Scoring rubric corresponding to Figure 11, used by the LLM-Judge to grade model [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Large Language Models and vision-language models have shown remarkable success in the front-end design of Very Large-Scale Integrated Circuits, yet their capabilities for VLSI physical design remain significantly underexplored. The primary cause is the lack of standardized benchmarks for evaluating agentic physical design workflows that require high-dimensional, multi-stage optimization under strict design constraints, coordinated interaction with diverse Electronic Design Automation tools, and iterative refinement. This work introduces PDAGENT-BENCH, a comprehensive and multi-dimensional benchmark for evaluating LLM/VLM-based agents across the physical design stack. PDAGENT-BENCH integrates both task-level assessment and workflow-level execution. The benchmark suite contains 353 curated problems that combine conceptual questions with real-world industrial artifacts, with expert-validated references and executable solutions. These tasks cover five key capability dimensions: foundational knowledge, report comprehension, root-cause analysis, script generation, and full-flow implementation. In addition, the benchmark provides a unified, human-aligned agentic physical design workflow framework that enables closed-loop evaluation of holistic physical design in realistic EDA environments. Experiments on 11 state-of-the-art models reveal that while modern LLMs/VLMs perform competitively on conceptual tasks, they remain substantially limited in tool-centric execution (e.g., 42.2% on Innovus script generation) and long-horizon, multi-stage reasoning. Our studies further show that human-skill-enhanced agentic workflows significantly improve end-to-end physical design performance. PDAGENT-BENCH establishes a standardized, reproducible, and realistic evaluation framework for advancing LLM/VLM-driven holistic physical design automation. We will open source the benchmark and framework soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDAGENT-BENCH is the first benchmark aimed at agentic LLM workflows for VLSI physical design, but the lack of detail on task curation leaves the performance claims hard to evaluate.

read the letter

The paper's main contribution is PDAGENT-BENCH, with 353 problems spanning conceptual questions and real industrial artifacts across five dimensions: knowledge, report comprehension, root-cause analysis, script generation, and full-flow implementation. It also supplies a closed-loop agentic workflow framework for testing in actual EDA environments. They ran 11 models and report competitive results on conceptual tasks but only 42.2% on Innovus script generation, plus gains from human-skill-enhanced workflows.

This is new for the physical design side of VLSI, where prior LLM work has focused more on front-end. The attempt to combine task-level and workflow-level evaluation with executable solutions is useful for a domain that needs standardized testing.

The soft spot is the validation. The abstract mentions expert-validated references but gives no numbers on inter-expert agreement, curation rules, or how the problem set matches real design distributions in cell counts, timing constraints, or tool patterns. Without that, the specific limitations reported could reflect selection choices rather than general model weaknesses.

The work targets researchers building or evaluating AI tools for chip physical design. Anyone tracking benchmarks for agentic systems in constrained engineering domains would find the setup worth examining.

It should go to peer review. The benchmark itself is a concrete step forward in an underexplored area, and the reported gaps in tool use and long-horizon reasoning are worth checking once the curation details are filled in.

Referee Report

2 major / 2 minor

Summary. The paper introduces PDAGENT-BENCH, a benchmark with 353 curated problems combining conceptual questions and real-world industrial artifacts for evaluating LLM/VLM agents on VLSI physical design. It defines five capability dimensions (foundational knowledge, report comprehension, root-cause analysis, script generation, full-flow implementation) and a unified agentic workflow framework for closed-loop EDA evaluation. Experiments on 11 models report competitive conceptual performance but limitations in tool-centric tasks (e.g., 42.2% on Innovus script generation) and long-horizon reasoning, with human-skill-enhanced workflows improving end-to-end results. The benchmark and framework are planned for open-sourcing.

Significance. If the benchmark's representativeness holds, the work would provide a valuable standardized, reproducible framework for assessing and advancing LLM-driven physical design automation, filling an underexplored gap in agentic EDA workflows and offering concrete metrics on current model limitations.

major comments (2)

[Benchmark construction and validation section] The central claims (e.g., 42.2% on Innovus script generation and benefits of human-enhanced workflows) rest on PDAGENT-BENCH accurately representing industrial VLSI challenges, yet the manuscript provides no quantitative evidence—such as distributions of cell counts, layer counts, timing constraints, or tool-call patterns—comparing the 353 problems to typical industrial flows. Expert validation alone does not address potential selection bias. (Benchmark construction and validation section; Abstract)
[Abstract and Experiments section] The abstract states 'expert-validated references' and reports specific scores, but supplies no details on curation criteria, inter-expert agreement, statistical significance testing, or exclusion rules. This undermines verifiability of the performance claims. (Abstract and Experiments section)

minor comments (2)

[Experiments section] Clarify whether the 11 models include both LLMs and VLMs and report per-model breakdowns for the five dimensions to strengthen the 'conceptual vs. tool-centric' distinction.
[Results and discussion] The claim of 'significantly improve end-to-end physical design performance' would benefit from explicit metrics (e.g., timing, power, area deltas) rather than qualitative description.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on benchmark validation and verifiability. We address the two major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Benchmark construction and validation section] The central claims (e.g., 42.2% on Innovus script generation and benefits of human-enhanced workflows) rest on PDAGENT-BENCH accurately representing industrial VLSI challenges, yet the manuscript provides no quantitative evidence—such as distributions of cell counts, layer counts, timing constraints, or tool-call patterns—comparing the 353 problems to typical industrial flows. Expert validation alone does not address potential selection bias. (Benchmark construction and validation section; Abstract)

Authors: We agree that quantitative comparisons would strengthen claims of representativeness. The 353 problems are derived from real-world industrial artifacts, with expert validation by VLSI physical design specialists to ensure relevance to industrial challenges. However, due to confidentiality constraints with industry partners, we cannot release full distributions of proprietary metrics such as exact cell counts or timing constraints across all problems. In the revised manuscript, we will add available non-confidential summary statistics on problem characteristics (e.g., ranges of design sizes and constraint types) and expand discussion of curation to mitigate selection bias concerns. revision: partial
Referee: [Abstract and Experiments section] The abstract states 'expert-validated references' and reports specific scores, but supplies no details on curation criteria, inter-expert agreement, statistical significance testing, or exclusion rules. This undermines verifiability of the performance claims. (Abstract and Experiments section)

Authors: We acknowledge that the current manuscript lacks sufficient detail on these aspects. The benchmark construction section describes expert validation, but we will revise both the abstract and the dedicated benchmark section to explicitly state curation criteria, report inter-expert agreement metrics (e.g., percentage agreement among validators), detail exclusion rules, and clarify that reported scores are empirical results on the fixed benchmark without additional statistical hypothesis testing, consistent with standard practice in benchmark papers. revision: yes

standing simulated objections not resolved

Full disclosure of quantitative distributions for all proprietary industrial metrics (e.g., exact cell counts, timing constraints) due to confidentiality agreements.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces PDAGENT-BENCH as an external benchmark consisting of 353 curated problems with expert-validated references, then reports empirical performance of 11 external LLMs/VLMs on those tasks and on human-enhanced workflows. No derivation chain reduces a claimed result to a fitted parameter, self-defined quantity, or self-citation by construction; the evaluations are presented as measurements against independently defined tasks rather than tautological outputs of the benchmark's own construction rules. The central claims rest on observable model behaviors and workflow outcomes, not on any internal redefinition or renaming that would force the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the curated problems and expert-validated references form a realistic proxy for industrial EDA workflows; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The five capability dimensions comprehensively cover the requirements for LLM agents in physical design.
Stated when defining the benchmark suite contents.

pith-pipeline@v0.9.1-grok · 5870 in / 1176 out tokens · 53788 ms · 2026-06-27T02:08:59.760306+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Abdelazeem

A. Abdelazeem. Systolic array implementation in RTL for TPU. https://github.com/ abdelazeem201/Systolic-array-implementation-in-RTL-for-TPU , 2021. Accessed: Apr. 2026

2021
[2]

Introducing claude opus 4.7, 2026

Anthropic. Introducing claude opus 4.7, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

2026
[3]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

2026
[4]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Innovus implementation system

Cadence. Innovus implementation system. URL https://www.cadence.com/en_US/home/ tools/digital-design-and-signoff/soc-implementation-and-floorplanning/ innovus-implementation-system.html
[6]

C. Chen, X. Xiang, C. Liu, Y . Shang, R. Guo, D. Liu, Y . Lu, Z. Hao, J. Luo, Z. Chen, et al. Xuantie-910: A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance risc-v processor with vector extension: Industrial product. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 52–64. IEEE, 2020

2020
[7]

Cheng, L

Q. Cheng, L. Lin, M. Huang, Q. Li, Z. Yang, L. Dai, H. Yu, Y .-J. Chen, Y . Shi, and M. Hashimoto. A 13-34 tops/w edge-ai processor featuring booth-value-confined accelerator, near-memory 10 computing, and contiguity-aware mapping. In2024 IEEE Asian Solid-State Circuits Conference (A-SSCC), pages 1–3, 2024. doi: 10.1109/A-SSCC60305.2024.10849341

work page doi:10.1109/a-sscc60305.2024.10849341 2024
[8]

Cheng, Q

Q. Cheng, Q. Li, W. Dong, M. Zhang, R. Zhang, M. Huang, H. Yu, Y . Shi, H. Awano, T. Sato, L. Lin, and M. Hashimoto. A 22nm resource-frugal hyper-heterogeneous multi-modal system- on-chip towards in-orbit computing. In2025 IEEE Custom Integrated Circuits Conference (CICC), pages 1–3, 2025. doi: 10.1109/CICC63670.2025.10983627

work page doi:10.1109/cicc63670.2025.10983627 2025
[9]

First Direct Observation of Two Different Hydrogen- Related Processes Corresponding to the Negative VTH Shift Under PBTI Stress in IGZO Transistors by Pd Hydrogen Spillover,

Q. Cheng, Q. Li, Z. Yang, Z. Kong, G. Niu, Y . Liang, J. Li, J. H. Park, W. Liao, H. Awano, T. Sato, L. Lin, and M. Hashimoto. A radiation-hardened neuromorphic imager with self-healing spiking pixels and unified spiking neural network for space robotics. In2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), pages 1–3, 2025. doi...

work page doi:10.23919/vlsitechnologyandcir65189.2025.11075180 2025
[10]

Cheng, Z

Q. Cheng, Z. Yang, H. Li, Q. Li, Z. Kong, G. Niu, Y . Liang, J. Li, J. Yoo, M. Hashimoto, and L. Lin. A radiation-hardened self-healing cmos imager with online pixel/logic annealing and tile-adaptive compression for space applications. In2026 IEEE International Solid-State Circuits Conference (ISSCC), volume 69, pages 390–392, 2026. doi: 10.1109/ISSCC4966...

work page doi:10.1109/isscc49663 2026
[11]

C.-T. Ho, H. Ren, and B. Khailany. Verilogcoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree (ast)-based waveform tracing tool. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 300–307, 2025

2025
[12]

A. B. Kahng, J. Lienig, I. L. Markov, and J. Hu.VLSI physical design: from graph partitioning to timing closure, volume 312. Springer, 2011

2011
[13]

J. Kindér. AES: Verilog implementation of the advanced encryption standard (AES-256). https://github.com/secworks/aes, 2014. Accessed: Apr. 2026

2014
[14]

Lavagno, L

L. Lavagno, L. Scheffer, and G. Martin.EDA for IC implementation, circuit design, and process technology. CRC press, 2018

2018
[15]

K. Liang. TinyRISC-V: A simple RISC-V core. https://github.com/liangkangnan/ tinyriscv, 2020. Accessed: Apr. 2026

2020
[16]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Liu, T.-D

M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktaroglu, et al. Chipnemo: Domain-adapted llms for chip design.arXiv preprint arXiv:2311.00176, 2023

work page arXiv 2023
[18]

S. Liu, Y . Lu, W. Fang, M. Li, and Z. Xie. Openllm-rtl: Open dataset and benchmark for llm-aided design rtl generation, 2025. URLhttps://arxiv.org/abs/2503.15112

work page arXiv 2025
[19]

W.-H. Liu, S. Mantik, W.-K. Chow, Y . Ding, A. Farshidi, and G. Posser. Ispd 2019 initial detailed routing contest and benchmark with advanced routing rules. InProceedings of the 2019 international symposium on physical design, pages 147–151, 2019

2019
[20]

Y . Lu, S. Liu, Q. Zhang, and Z. Xie. Rtllm: An open-source benchmark for design rtl generation with large language model, 2023. URLhttps://arxiv.org/abs/2308.05345

work page arXiv 2023
[21]

NVDLA: Open source deep learning accelerator

NVIDIA. NVDLA: Open source deep learning accelerator. https://github.com/nvdla/hw,
[22]

Ethernet MAC 10/100 Mbps

OpenCores. Ethernet MAC 10/100 Mbps. https://opencores.org/projects/, 2001. Accessed: Apr. 2026

2001
[23]

Pinckney, C

N. Pinckney, C. Batten, M. Liu, H. Ren, and B. Khailany. Revisiting verilogeval: Newer llms, in-context learning, and specification-to-rtl tasks.arXiv preprint arXiv:2408.11053, 2024

work page arXiv 2024
[24]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Vc formal: Formal verification solution,

Synopsys. Vc formal: Formal verification solution, . URL https://www.synopsys.com/ verification/static-and-formal-verification/vc-formal.html
[26]

Ic compiler ii: Place & route solution,

Synopsys. Ic compiler ii: Place & route solution, . URL https://www.synopsys.com/ implementation-and-signoff/physical-implementation/ic-compiler.html
[27]

Primetime static timing analysis,

Synopsys. Primetime static timing analysis, . URL https://www.synopsys.com/ implementation-and-signoff/signoff/primetime.html
[28]

UART controller IP core

Ultra-Embedded. UART controller IP core. https://github.com/ultraembedded/cores,
[29]

M. Wang, Y . Wen, Y . Lu, F. Liu, Y . Zhao, B. Han, J. Mu, Y . Lin, R. Wang, B. Yu, et al. Circuitnet 3.0: A multi-modal dataset with task-oriented augmentation for ai-driven circuit design. InThe Fourteenth International Conference on Learning Representations
[30]

Z. Wang, Z. Geng, Z. Tu, J. Wang, Y . Qian, Z. Xu, Z. Liu, S. Xu, Z. Tang, S. Kai, et al. Benchmarking end-to-end performance of ai-based chip placement algorithms.arXiv preprint arXiv:2407.15026, 2024

work page arXiv 2024
[31]

N. H. Weste and D. Harris.CMOS VLSI design: a circuits and systems perspective. Pearson Education India, 2015

2015
[32]

H. Wu, Z. He, X. Zhang, X. Yao, S. Zheng, H. Zheng, and B. Yu. Chateda: A large language model powered autonomous agent for eda.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(10):3184–3197, 2024

2024
[33]

N. Xu, Z. Zhang, S. Shu, L. Qi, J. Lv, W. Wang, T. Zhao, C. Zhang, Z. Yang, X. Li, et al. iscript: A domain-adapted large language model and benchmark for physical design tcl script generation.arXiv preprint arXiv:2603.04476, 2026

work page arXiv 2026
[34]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

B. Yu. Machine learning in eda: When and how. In2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD), pages 1–6. IEEE, 2023

2023
[36]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

G. Zhang, H. Geng, X. Yu, Z. Yin, Z. Zhang, Z. Tan, H. Zhou, Z. Li, X. Xue, Y . Li, et al. The land- scape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[38]

Missing one column

Y . Zhu, D. Huang, H. Lyu, X. Zhang, C. Li, W. Shi, Y . Wu, J. Mu, J. Wang, Y . Zhao, et al. Qimeng-codev-r1: Reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.24183, 2025. 12 Easy 31% (28) Medium 39% (35) Hard30% (27) Difficulty Distribution 0 5 10 15 20 25 30 Number of questions Static Timing Analysis Physical Design Floorplanning PD Funda...

work page arXiv 2025
[39]

Technology file( .tf in Synopsys, .techlef in Cadence): describes units, drawing patterns, layers, design rules, vias, and parasitic R/C of the manufacturing process
[40]

Physical libraries( .lef, .gds; or Synopsys .CEL, .FRAM): layout information and abstract models for placement and routing (pin accessibility, blockages, etc.)
[41]

Timing, logical, and power libraries( .lib, or LM view .db): timing and power information for all design elements
[42]

5.Constraints(.sdc): area, power, and timing constraints

TDF file( .tdf / .io): pad/pin arrangement (order and location); for full-chip flows also captures VDD/VSS pads and power-cut diodes not present in the Verilog netlist. 5.Constraints(.sdc): area, power, and timing constraints. 6.PDEF(optional): row and cell placement locations. 7.DEF(optional): row, cell, and pre-existing placement information. Output dat...
[43]

Input Files — Technology & Libraries (3 pts).1 pt each for: (a) technology file ( .tf / .techlef) and its role in process rules and parasitics; (b) physical libraries (.lef / .gds / .cel / .fram) and their role in layout abstraction; (c) timing/power libraries ( .lib / .db) and their role in timing/power characterization
[44]

Input Files — Constraints & Optional Files (2 pts).1 pt each for: (a) SDC constraints file (timing, area, power); (b) IO/TDF or DEF/PDEF for pad placement or pre-existing placement data
[45]

Output Files — Timing & Parasitics (2 pts).1 pt each for: (a) SDF file for post-layout timing delays; (b) SPEF or DSPF for extracted RC parasitics
[46]

Output Files — Netlist, Layout & DEF (2 pts).1 pt for the post-routed Verilog netlist (.v, flat or hierarchical); 0.5 pt for GDS (physical layout); 0.5 pt for DEF (final placement and routing data)
[47]

outputs clearly separated), file extensions correctly associated with descriptions, and no significant factual errors or omissions of major file types

Clarity, Completeness & Technical Accuracy (1 pt).Answer is well-organized (inputs vs. outputs clearly separated), file extensions correctly associated with descriptions, and no significant factual errors or omissions of major file types. Figure 12: Scoring rubric corresponding to Figure 11, used by the LLM-Judge to grade model responses. 21

[1] [1]

Abdelazeem

A. Abdelazeem. Systolic array implementation in RTL for TPU. https://github.com/ abdelazeem201/Systolic-array-implementation-in-RTL-for-TPU , 2021. Accessed: Apr. 2026

2021

[2] [2]

Introducing claude opus 4.7, 2026

Anthropic. Introducing claude opus 4.7, 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

2026

[3] [3]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6

2026

[4] [4]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Innovus implementation system

Cadence. Innovus implementation system. URL https://www.cadence.com/en_US/home/ tools/digital-design-and-signoff/soc-implementation-and-floorplanning/ innovus-implementation-system.html

[6] [6]

C. Chen, X. Xiang, C. Liu, Y . Shang, R. Guo, D. Liu, Y . Lu, Z. Hao, J. Luo, Z. Chen, et al. Xuantie-910: A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance risc-v processor with vector extension: Industrial product. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 52–64. IEEE, 2020

2020

[7] [7]

Cheng, L

Q. Cheng, L. Lin, M. Huang, Q. Li, Z. Yang, L. Dai, H. Yu, Y .-J. Chen, Y . Shi, and M. Hashimoto. A 13-34 tops/w edge-ai processor featuring booth-value-confined accelerator, near-memory 10 computing, and contiguity-aware mapping. In2024 IEEE Asian Solid-State Circuits Conference (A-SSCC), pages 1–3, 2024. doi: 10.1109/A-SSCC60305.2024.10849341

work page doi:10.1109/a-sscc60305.2024.10849341 2024

[8] [8]

Cheng, Q

Q. Cheng, Q. Li, W. Dong, M. Zhang, R. Zhang, M. Huang, H. Yu, Y . Shi, H. Awano, T. Sato, L. Lin, and M. Hashimoto. A 22nm resource-frugal hyper-heterogeneous multi-modal system- on-chip towards in-orbit computing. In2025 IEEE Custom Integrated Circuits Conference (CICC), pages 1–3, 2025. doi: 10.1109/CICC63670.2025.10983627

work page doi:10.1109/cicc63670.2025.10983627 2025

[9] [9]

First Direct Observation of Two Different Hydrogen- Related Processes Corresponding to the Negative VTH Shift Under PBTI Stress in IGZO Transistors by Pd Hydrogen Spillover,

Q. Cheng, Q. Li, Z. Yang, Z. Kong, G. Niu, Y . Liang, J. Li, J. H. Park, W. Liao, H. Awano, T. Sato, L. Lin, and M. Hashimoto. A radiation-hardened neuromorphic imager with self-healing spiking pixels and unified spiking neural network for space robotics. In2025 Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), pages 1–3, 2025. doi...

work page doi:10.23919/vlsitechnologyandcir65189.2025.11075180 2025

[10] [10]

Cheng, Z

Q. Cheng, Z. Yang, H. Li, Q. Li, Z. Kong, G. Niu, Y . Liang, J. Li, J. Yoo, M. Hashimoto, and L. Lin. A radiation-hardened self-healing cmos imager with online pixel/logic annealing and tile-adaptive compression for space applications. In2026 IEEE International Solid-State Circuits Conference (ISSCC), volume 69, pages 390–392, 2026. doi: 10.1109/ISSCC4966...

work page doi:10.1109/isscc49663 2026

[11] [11]

C.-T. Ho, H. Ren, and B. Khailany. Verilogcoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree (ast)-based waveform tracing tool. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 300–307, 2025

2025

[12] [12]

A. B. Kahng, J. Lienig, I. L. Markov, and J. Hu.VLSI physical design: from graph partitioning to timing closure, volume 312. Springer, 2011

2011

[13] [13]

J. Kindér. AES: Verilog implementation of the advanced encryption standard (AES-256). https://github.com/secworks/aes, 2014. Accessed: Apr. 2026

2014

[14] [14]

Lavagno, L

L. Lavagno, L. Scheffer, and G. Martin.EDA for IC implementation, circuit design, and process technology. CRC press, 2018

2018

[15] [15]

K. Liang. TinyRISC-V: A simple RISC-V core. https://github.com/liangkangnan/ tinyriscv, 2020. Accessed: Apr. 2026

2020

[16] [16]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Liu, T.-D

M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktaroglu, et al. Chipnemo: Domain-adapted llms for chip design.arXiv preprint arXiv:2311.00176, 2023

work page arXiv 2023

[18] [18]

S. Liu, Y . Lu, W. Fang, M. Li, and Z. Xie. Openllm-rtl: Open dataset and benchmark for llm-aided design rtl generation, 2025. URLhttps://arxiv.org/abs/2503.15112

work page arXiv 2025

[19] [19]

W.-H. Liu, S. Mantik, W.-K. Chow, Y . Ding, A. Farshidi, and G. Posser. Ispd 2019 initial detailed routing contest and benchmark with advanced routing rules. InProceedings of the 2019 international symposium on physical design, pages 147–151, 2019

2019

[20] [20]

Y . Lu, S. Liu, Q. Zhang, and Z. Xie. Rtllm: An open-source benchmark for design rtl generation with large language model, 2023. URLhttps://arxiv.org/abs/2308.05345

work page arXiv 2023

[21] [21]

NVDLA: Open source deep learning accelerator

NVIDIA. NVDLA: Open source deep learning accelerator. https://github.com/nvdla/hw,

[22] [22]

Ethernet MAC 10/100 Mbps

OpenCores. Ethernet MAC 10/100 Mbps. https://opencores.org/projects/, 2001. Accessed: Apr. 2026

2001

[23] [23]

Pinckney, C

N. Pinckney, C. Batten, M. Liu, H. Ren, and B. Khailany. Revisiting verilogeval: Newer llms, in-context learning, and specification-to-rtl tasks.arXiv preprint arXiv:2408.11053, 2024

work page arXiv 2024

[24] [24]

OpenAI GPT-5 System Card

A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Vc formal: Formal verification solution,

Synopsys. Vc formal: Formal verification solution, . URL https://www.synopsys.com/ verification/static-and-formal-verification/vc-formal.html

[26] [26]

Ic compiler ii: Place & route solution,

Synopsys. Ic compiler ii: Place & route solution, . URL https://www.synopsys.com/ implementation-and-signoff/physical-implementation/ic-compiler.html

[27] [27]

Primetime static timing analysis,

Synopsys. Primetime static timing analysis, . URL https://www.synopsys.com/ implementation-and-signoff/signoff/primetime.html

[28] [28]

UART controller IP core

Ultra-Embedded. UART controller IP core. https://github.com/ultraembedded/cores,

[29] [29]

M. Wang, Y . Wen, Y . Lu, F. Liu, Y . Zhao, B. Han, J. Mu, Y . Lin, R. Wang, B. Yu, et al. Circuitnet 3.0: A multi-modal dataset with task-oriented augmentation for ai-driven circuit design. InThe Fourteenth International Conference on Learning Representations

[30] [30]

Z. Wang, Z. Geng, Z. Tu, J. Wang, Y . Qian, Z. Xu, Z. Liu, S. Xu, Z. Tang, S. Kai, et al. Benchmarking end-to-end performance of ai-based chip placement algorithms.arXiv preprint arXiv:2407.15026, 2024

work page arXiv 2024

[31] [31]

N. H. Weste and D. Harris.CMOS VLSI design: a circuits and systems perspective. Pearson Education India, 2015

2015

[32] [32]

H. Wu, Z. He, X. Zhang, X. Yao, S. Zheng, H. Zheng, and B. Yu. Chateda: A large language model powered autonomous agent for eda.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(10):3184–3197, 2024

2024

[33] [33]

N. Xu, Z. Zhang, S. Shu, L. Qi, J. Lv, W. Wang, T. Zhao, C. Zhang, Z. Yang, X. Li, et al. iscript: A domain-adapted large language model and benchmark for physical design tcl script generation.arXiv preprint arXiv:2603.04476, 2026

work page arXiv 2026

[34] [34]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

B. Yu. Machine learning in eda: When and how. In2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD), pages 1–6. IEEE, 2023

2023

[36] [36]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

G. Zhang, H. Geng, X. Yu, Z. Yin, Z. Zhang, Z. Tan, H. Zhou, Z. Li, X. Xue, Y . Li, et al. The land- scape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023

[38] [38]

Missing one column

Y . Zhu, D. Huang, H. Lyu, X. Zhang, C. Li, W. Shi, Y . Wu, J. Mu, J. Wang, Y . Zhao, et al. Qimeng-codev-r1: Reasoning-enhanced verilog generation.arXiv preprint arXiv:2505.24183, 2025. 12 Easy 31% (28) Medium 39% (35) Hard30% (27) Difficulty Distribution 0 5 10 15 20 25 30 Number of questions Static Timing Analysis Physical Design Floorplanning PD Funda...

work page arXiv 2025

[39] [39]

Technology file( .tf in Synopsys, .techlef in Cadence): describes units, drawing patterns, layers, design rules, vias, and parasitic R/C of the manufacturing process

[40] [40]

Physical libraries( .lef, .gds; or Synopsys .CEL, .FRAM): layout information and abstract models for placement and routing (pin accessibility, blockages, etc.)

[41] [41]

Timing, logical, and power libraries( .lib, or LM view .db): timing and power information for all design elements

[42] [42]

5.Constraints(.sdc): area, power, and timing constraints

TDF file( .tdf / .io): pad/pin arrangement (order and location); for full-chip flows also captures VDD/VSS pads and power-cut diodes not present in the Verilog netlist. 5.Constraints(.sdc): area, power, and timing constraints. 6.PDEF(optional): row and cell placement locations. 7.DEF(optional): row, cell, and pre-existing placement information. Output dat...

[43] [43]

Input Files — Technology & Libraries (3 pts).1 pt each for: (a) technology file ( .tf / .techlef) and its role in process rules and parasitics; (b) physical libraries (.lef / .gds / .cel / .fram) and their role in layout abstraction; (c) timing/power libraries ( .lib / .db) and their role in timing/power characterization

[44] [44]

Input Files — Constraints & Optional Files (2 pts).1 pt each for: (a) SDC constraints file (timing, area, power); (b) IO/TDF or DEF/PDEF for pad placement or pre-existing placement data

[45] [45]

Output Files — Timing & Parasitics (2 pts).1 pt each for: (a) SDF file for post-layout timing delays; (b) SPEF or DSPF for extracted RC parasitics

[46] [46]

Output Files — Netlist, Layout & DEF (2 pts).1 pt for the post-routed Verilog netlist (.v, flat or hierarchical); 0.5 pt for GDS (physical layout); 0.5 pt for DEF (final placement and routing data)

[47] [47]

outputs clearly separated), file extensions correctly associated with descriptions, and no significant factual errors or omissions of major file types

Clarity, Completeness & Technical Accuracy (1 pt).Answer is well-organized (inputs vs. outputs clearly separated), file extensions correctly associated with descriptions, and no significant factual errors or omissions of major file types. Figure 12: Scoring rubric corresponding to Figure 11, used by the LLM-Judge to grade model responses. 21