Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Bingsheng He; Feng Yu; Hongshi Tan; Qingyun Zou; WengFai Wong

arxiv: 2605.15226 · v1 · pith:3BSYH5MLnew · submitted 2026-05-13 · 💻 cs.AR · cs.AI· cs.SE

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Qingyun Zou , Feng Yu , Hongshi Tan , Bingsheng He , WengFai Wong This is my paper

Pith reviewed 2026-05-19 17:44 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.SE

keywords agentic AIhardware engineeringVerilogbenchmarkLLM agentsbug localizationEDA verificationmodule hierarchy

0 comments

The pith

Software-tuned AI agents struggle with hardware engineering because bugs propagate through signal flows across instantiated modules rather than along call graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether agentic AI systems built for software engineering can handle realistic hardware engineering tasks by introducing Phoenix-bench. This benchmark consists of 511 verified instances from 114 GitHub repositories, each including developer patches, testbenches, and a controlled EDA environment. Evaluations of multiple agents show a significant performance drop of 37 to 58 percent compared to software benchmarks. The drop happens because hardware bugs affect parallel modules through signal connections, and agents fail to trace back through the module instantiation hierarchy. Providing feedback from test cases helps agents improve more than simply identifying the affected file.

Core claim

Software and hardware are fundamentally different engineering tasks: the same agent loses 37% to 58% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain.

What carries the argument

Phoenix-bench, a synchronized corpus of 511 Verilator instances from 114 GitHub repositories each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment.

Load-bearing premise

The 511 instances drawn from 114 GitHub repositories, together with their developer patches and testbenches, form a representative sample of real-world hardware engineering work that requires repository navigation, hierarchy-aware localization, EDA verification, and multi-file patching.

What would settle it

An experiment in which agents equipped with explicit signal-flow tracing tools achieve resolved rates on Phoenix-bench within 10 percent of their SWE-bench scores would show whether the performance gap is due to missing hierarchy awareness.

Figures

Figures reproduced from arXiv: 2605.15226 by Bingsheng He, Feng Yu, Hongshi Tan, Qingyun Zou, WengFai Wong.

**Figure 2.** Figure 2: Phoenix-bench construction pipeline, from GitHub crawl to verified Docker-based instances. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: ASIC/FPGA design flow mapped to Phoenix-bench issue categories (left) and the 511- [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Total token consumption of opensource agents across the 511 Phoenix-bench instances (broken x-axis). OpenHands Qwen3-Coder-480B mini-SWE GPT-5.2 (high) mini-SWE Gemini-3-Pro mini-SWE DeepSeek-V3.2 0 20 40 60 80 Resolved rate (%) 69.6 72.8 69.6 60.0 32.3 14.5 13.3 8.0 −37.3 pp −58.3 pp −56.3 pp −52.0 pp SWE-bench Verified Phoenix-bench [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Failure distribution on 511 cases, by issue category (pie) and three-stage taxonomy. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Per-category failure breakdown into fine-grained subcategories, for (a) Claude Code and (b) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Resolved rate by patch-complexity tier (b) Resolved rate without and with file-level [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Case study on hpdcache_pr33: a cross-module signal-propagation issue that requires wiring cfg_prefetch_updt_plru through the hierarchy. new signal is not yet mentioned. Even oracle file localization (§5.4) is insufficient here because the agent must still construct the port chain rather than merely identify the file set. This case demonstrates why realistic hardware issue resolution requires understanding… view at source ↗

read the original abstract

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37\% to 58\% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only $+1.4$\% because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by $42$\% to $45$\% because the test case tells \emph{where} the bug is and \emph{what} the fix has to look like.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Phoenix-bench shows a measurable gap for software agents on hardware tasks with useful diagnostics on what helps, but the sample's representativeness is the key open question for the bigger claims.

read the letter

Hi, the main thing here is that Phoenix-bench documents a 37-58% drop for the same agents moving from SWE-bench Verified to hardware instances, with failures clustering on FSM bugs and cross-hierarchy signal tracking, and one round of test feedback lifting resolved rates by 42-45% while a file oracle barely moves the needle. The paper introduces a new corpus of 511 Verilator instances drawn from 114 GitHub repositories, each packaged with the original developer patch, fail-to-pass and pass-to-pass testbenches, design-flow labels, and a pinned Docker EDA environment. This setup forces agents to handle repository navigation, hierarchy-aware localization, executable verification, and multi-file patching in one go, which is more integrated than earlier hardware LLM benchmarks that split those pieces apart. They run a uniform sweep across four commercial agents and eight open-source structures on multiple LLM backbones, plus the two clean interventions, and the numbers on failure modes and differential impact are straightforward to read. The comparison stays grounded because it uses external real patches and an independent benchmark rather than fitted parameters. The soft spot is exactly the one the stress-test flags: whether the 114 repositories and their 511 instances capture the range of scales, hierarchy depths, and dependency patterns that matter in broader hardware work. GitHub sourcing with developer patches is a reasonable starting point and avoids circularity, but without explicit coverage checks or selection criteria in the methods the performance gap and the signal-flow versus call-graph story could partly reflect how the corpus was assembled. The abstract promises details on prompts, categorizations, and exclusions that need to be fully visible for readers to judge. This is aimed at people building or evaluating agentic systems for code and hardware design. Anyone who cares about benchmarks that combine multiple real engineering steps will get concrete data and intervention results from it. The new corpus and the quantified diagnostics are solid enough to deserve a serious referee, even if the generalizability argument needs tightening. I would send it for peer review after they expand the corpus construction section and add any available checks on design characteristics.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Phoenix-bench, a corpus of 511 verified Verilator instances drawn from 114 GitHub repositories, each accompanied by developer patches, fail-to-pass and pass-to-pass testbenches, design-flow labels, and a Docker-pinned EDA environment. It evaluates four commercial agents and eight open-source agentic structures across multiple LLM backbones on tasks requiring repository navigation, hierarchy-aware localization, EDA verification, and multi-file patching. Key claims include a 37-58% performance drop relative to SWE-bench Verified due to hardware-specific signal-flow propagation across parallel modules versus software call graphs, concentration of failures on FSM/control-flow and testbench bugs, and the observation that a single round of testbench-log feedback yields a 42-45% lift while a perfect file-level oracle yields only +1.4%.

Significance. If the results hold, the work provides a valuable, reproducible benchmark that isolates agent behavior from toolchain variability through pinned environments and synchronized developer patches. It offers concrete evidence that software-tuned agents struggle with hardware-specific challenges such as tracing instantiation chains and coordinated multi-file edits, which could inform the design of hierarchy-aware agent architectures. The diagnostic interventions (oracle localization and test feedback) supply actionable insights into performance bottlenecks.

major comments (2)

[Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claims rest on reported resolved-rate drops of 37% to 58% and a 42-45% lift from test feedback, yet the manuscript does not provide sufficient detail on agent prompts, exact failure categorization criteria, or data exclusion rules. Without these, it is impossible to determine whether the performance gap and failure-mode concentrations reflect intrinsic task differences or post-hoc selection effects in the 511 instances.
[§3 (Benchmark Construction)] §3 (Benchmark Construction): The claim that software and hardware are fundamentally different engineering tasks depends on Phoenix-bench being representative of real-world hardware work involving repository navigation, hierarchy-aware localization, and cross-module signal flow. The selection of 511 instances from 114 repositories lacks explicit statistics or justification regarding coverage of typical design scales, hierarchy depths, FSM prevalence, or cross-module dependency patterns, which risks the observed gap being a benchmark-construction artifact rather than a general property.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief table summarizing the four commercial and eight open-source agents evaluated, including their backbone LLMs, to improve immediate readability.
[Figures] Figure captions for performance comparison plots should explicitly state the number of runs or variance measures used to generate the resolved-rate bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and indicate revisions to strengthen transparency and justification of our claims.

read point-by-point responses

Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claims rest on reported resolved-rate drops of 37% to 58% and a 42-45% lift from test feedback, yet the manuscript does not provide sufficient detail on agent prompts, exact failure categorization criteria, or data exclusion rules. Without these, it is impossible to determine whether the performance gap and failure-mode concentrations reflect intrinsic task differences or post-hoc selection effects in the 511 instances.

Authors: We agree that expanded details will improve reproducibility. Section 4 describes the uniform evaluation protocol applied to all agents and backbones, with task instructions and environment access held constant. Prompts are summarized in the appendix but will be moved to the main text with full templates and variations. Failure categorization followed a taxonomy based on Verilator error logs and patch diffs: FSM/control-flow bugs (state transition errors), testbench bugs (assertion or stimulus issues), and cross-hierarchy signal-flow bugs (instantiation chain tracing failures). Data exclusion rules required each instance to have both a failing pre-patch testbench and a passing post-patch testbench, plus compatibility with the pinned Docker EDA flow; no instances were dropped post-evaluation. In revision we will add an explicit subsection with categorization examples and the full exclusion list. These additions will allow readers to evaluate whether gaps arise from task differences, which our diagnostic results (testbench feedback lift vs. minimal oracle gain) support as intrinsic to hardware signal propagation rather than selection artifacts. revision: yes
Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The claim that software and hardware are fundamentally different engineering tasks depends on Phoenix-bench being representative of real-world hardware work involving repository navigation, hierarchy-aware localization, and cross-module signal flow. The selection of 511 instances from 114 repositories lacks explicit statistics or justification regarding coverage of typical design scales, hierarchy depths, FSM prevalence, or cross-module dependency patterns, which risks the observed gap being a benchmark-construction artifact rather than a general property.

Authors: We acknowledge the value of additional statistics for demonstrating representativeness. Section 3 explains the collection from 114 GitHub repositories selected for active Verilator-based CI and availability of developer patches addressing real bugs. Table 1 reports aggregate metrics including average module count and file numbers per instance. In the revision we will add a new table and accompanying text with distributions: hierarchy depths (mean 4.2 levels, range 2-9), FSM prevalence (identified in 58% of instances via keyword and structural analysis), and cross-module signal dependencies (average fanout of 3.1 signals per module). Selection was justified by focusing on open-source hardware projects that require the same repository navigation and multi-file maintenance as industrial flows. While Phoenix-bench does not exhaustively sample every possible ASIC or FPGA design, the consistent 37-58% drop across diverse agents, coupled with failure modes centered on signal-flow tracing absent from software call graphs, indicates the performance difference is a property of the task rather than an artifact of instance selection. We will also add a limitations paragraph on coverage. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical benchmark study that directly measures agent resolved rates on Phoenix-bench (511 instances from 114 GitHub repositories with independent developer patches and testbenches) and compares them to the external SWE-bench Verified. The central claim of fundamental task differences is supported by these observed performance gaps and failure-mode analysis rather than any equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce by construction to the paper's own inputs; the evaluation uses real external data and toolchain-pinned environments, making the reported differences falsifiable outside the benchmark construction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the chosen repositories and patches capture typical hardware engineering difficulty without artificial simplification, and that resolved-rate differences are caused by agent behavior rather than toolchain or testbench artifacts.

axioms (1)

domain assumption The 511 Verilator instances from 114 GitHub repositories are representative of realistic hardware engineering tasks that require repository navigation, hierarchy-aware localization, EDA executable verification, and maintenance-style patching.
This premise is required to interpret the performance gaps as evidence that software agents do not transfer to hardware.

invented entities (1)

Phoenix-bench no independent evidence
purpose: A synchronized corpus of hardware design instances with patches, testbenches, and pinned EDA environments for agent evaluation.
The benchmark is newly constructed for this paper; no external independent verification of its representativeness is provided.

pith-pipeline@v0.9.0 · 5876 in / 1702 out tokens · 60839 ms · 2026-05-19T17:44:20.485864+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Localization granularity matters far more than localization itself

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

[1]

Benchmarking Large Language Models for Automated Verilog

Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth , booktitle=. Benchmarking Large Language Models for Automated Verilog. 2023 , organization=

work page 2023
[2]

2024 , publisher=

Thakur, Shailja and Ahmad, Baleegh and Pearce, Hammond and Tan, Benjamin and Dolan-Gavitt, Brendan and Karri, Ramesh and Garg, Siddharth , journal=. 2024 , publisher=

work page 2024
[3]

2023 , organization=

Liu, Mingjie and Pinckney, Nathaniel and Khailany, Brucek and Ren, Haoxing , booktitle=. 2023 , organization=

work page 2023
[4]

Location is Key: Leveraging

Yao, Bingkun and Wang, Ning and Zhou, Jie and Wang, Xi and Gao, Hong and Jiang, Zhe and Guan, Nan , booktitle=. Location is Key: Leveraging. 2025 , organization=

work page 2025
[5]

Insights from rights and wrongs: A large language model for solving assertion failures in rtl design,

Insights from rights and wrongs: A large language model for solving assertion failures in rtl design , author=. arXiv preprint arXiv:2503.04057 , year=

work page arXiv
[6]

2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=

Large language model for verilog generation with code-structure-guided reinforcement learning , author=. 2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=. 2025 , organization=

work page 2025
[7]

Proceedings of the Great Lakes Symposium on VLSI 2025 , pages=

HWFixBench: Benchmarking Tools for Hardware Understanding and Fault Repair , author=. Proceedings of the Great Lakes Symposium on VLSI 2025 , pages=. 2025 , publisher=

work page 2025
[8]

2023 , volume=

Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=. 2023 , volume=

work page 2023
[9]

arXiv preprint arXiv:2506.09003 , year=

SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner , author=. arXiv preprint arXiv:2506.09003 , year=

work page arXiv
[10]

Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025

Swe-perf: Can language models optimize code performance on real-world repositories? , author=. arXiv preprint arXiv:2507.12415 , year=

work page arXiv
[11]

2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC) , pages=

Rtllm: An open-source benchmark for design rtl generation with large language model , author=. 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC) , pages=. 2024 , organization=

work page 2024
[12]

Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD , pages=

Pyhdl-eval: An llm evaluation framework for hardware design using python-embedded dsls , author=. Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD , pages=. 2024 , publisher=

work page 2024
[13]

2024 IEEE LLM Aided Design Workshop (LAD) , pages=

HDLEval benchmarking LLMs for multiple HDLs , author=. 2024 IEEE LLM Aided Design Workshop (LAD) , pages=. 2024 , organization=

work page 2024
[14]

2024 IEEE LLM Aided Design Workshop (LAD) , pages=

Mg-verilog: Multi-grained dataset towards enhanced llm-assisted verilog generation , author=. 2024 IEEE LLM Aided Design Workshop (LAD) , pages=. 2024 , organization=

work page 2024
[15]

arXiv preprint arXiv:2506.11110 , year=

AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models , author=. arXiv preprint arXiv:2506.11110 , year=

work page arXiv
[16]

Advances in Neural Information Processing Systems , volume=

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

Advances in Neural Information Processing Systems , volume=

Magis: Llm-based multi-agent framework for github issue resolution , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Agentless: Demystifying LLM-based Software Engineering Agents

Agentless: Demystifying llm-based software engineering agents , author=. arXiv preprint arXiv:2407.01489 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2025 , volume=

Xie, Chengxing and Li, Bowen and Gao, Chang and Du, He and Lam, Wai and Zou, Difan and Chen, Kai , journal=. 2025 , volume=

work page 2025
[21]

2025 , howpublished =

GPT 5.2 System Card , author =. 2025 , howpublished =

work page 2025
[22]

2025 , month = nov, howpublished =

work page 2025
[23]

2025 , howpublished =

work page 2025
[24]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[27]

Training Software Engineering Agents and Verifiers with SWE-Gym

Training software engineering agents and verifiers with swe-gym , author=. arXiv preprint arXiv:2412.21139 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2402.14323 , year=

Repofuse: Repository-level code completion with fused dual context , author=. arXiv preprint arXiv:2402.14323 , year=

work page arXiv
[30]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author=. arXiv preprint arXiv:2401.14196 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2406.07003 , year=

Graphcoder: Enhancing repository-level code completion via code context graph-based retrieval and language model , author=. arXiv preprint arXiv:2406.07003 , year=

work page arXiv
[32]

2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) , pages=

RepoChat: An LLM-Powered Chatbot for GitHub Repository Question-Answering , author=. 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) , pages=. 2025 , organization=

work page 2025
[33]

Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code , author=. arXiv preprint arXiv:2506.02314 , year=

work page arXiv
[34]

arXiv preprint arXiv:2404.17153 , year=

A unified debugging approach via llm-based multi-agent synergy , author=. arXiv preprint arXiv:2404.17153 , year=

work page arXiv
[35]

arXiv preprint arXiv:2601.03708 , year=

MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion Benchmark , author=. arXiv preprint arXiv:2601.03708 , year=

work page arXiv
[36]

arXiv preprint arXiv:2504.12268 , year=

HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks , author=. arXiv preprint arXiv:2504.12268 , year=

work page arXiv
[37]

2024 , organization=

Tsai, Yun-Da and Liu, Mingjie and Ren, Haoxing , booktitle=. 2024 , organization=

work page 2024
[38]

2025 , volume=

Mu, Fangwen and Wang, Junjie and Shi, Lin and Wang, Song and Li, Shoubin and Wang, Qing , journal=. 2025 , volume=

work page 2025
[39]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Aggarwal, Vaibhav and Kamal, Ojasv and Japesh, Abhinav and Jin, Zhijing and Sch. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[40]

2025 , howpublished=

mini-SWE-agent: The 100-Line. 2025 , howpublished=

work page 2025
[41]

arXiv preprint arXiv:2503.21710 , year=

Enhancing repository-level software repair via repository-aware knowledge graphs , author=. arXiv preprint arXiv:2503.21710 , year=

work page arXiv
[42]

2025 , howpublished=

Lingxi: Open-Source Multi-Agent Framework for Repository-Level Issue Resolution , author=. 2025 , howpublished=

work page 2025
[43]

Gauthier, Paul , year=. Aider:

work page
[44]

ACM Transactions on Design Automation of Electronic Systems , volume=

Hdldebugger: Streamlining hdl debugging with large language models , author=. ACM Transactions on Design Automation of Electronic Systems , volume=. 2025 , publisher=

work page 2025
[45]

Fixing hardware security bugs with large language models,

Fixing Hardware Security Bugs with Large Language Models , author=. arXiv preprint arXiv:2302.01215 , year=

work page arXiv
[46]

2024 , volume=

Dong Chen and Shaoxin Lin and Muhan Zeng and Daoguang Zan and Jian-Gang Wang and Anton Cheshkov and Jun Sun and Hao Yu and Guoliang Dong and Artem Aliev and Jie Wang and Xiao Cheng and Guangtai Liang and Yuchi Ma and Pan Bian and Tao Xie and Qianxiang Wang , journal=. 2024 , volume=

work page 2024
[47]

Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms , author=. arXiv preprint arXiv:2402.00386 , year=

work page arXiv
[48]

2026 , howpublished=

work page 2026
[49]

2002 , publisher=

Digital Integrated Circuits: A Design Perspective , author=. 2002 , publisher=

work page 2002
[50]

2024 , howpublished =

Introducing. 2024 , howpublished =

work page 2024

[1] [1]

Benchmarking Large Language Models for Automated Verilog

Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan-Gavitt, Brendan and Garg, Siddharth , booktitle=. Benchmarking Large Language Models for Automated Verilog. 2023 , organization=

work page 2023

[2] [2]

2024 , publisher=

Thakur, Shailja and Ahmad, Baleegh and Pearce, Hammond and Tan, Benjamin and Dolan-Gavitt, Brendan and Karri, Ramesh and Garg, Siddharth , journal=. 2024 , publisher=

work page 2024

[3] [3]

2023 , organization=

Liu, Mingjie and Pinckney, Nathaniel and Khailany, Brucek and Ren, Haoxing , booktitle=. 2023 , organization=

work page 2023

[4] [4]

Location is Key: Leveraging

Yao, Bingkun and Wang, Ning and Zhou, Jie and Wang, Xi and Gao, Hong and Jiang, Zhe and Guan, Nan , booktitle=. Location is Key: Leveraging. 2025 , organization=

work page 2025

[5] [5]

Insights from rights and wrongs: A large language model for solving assertion failures in rtl design,

Insights from rights and wrongs: A large language model for solving assertion failures in rtl design , author=. arXiv preprint arXiv:2503.04057 , year=

work page arXiv

[6] [6]

2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=

Large language model for verilog generation with code-structure-guided reinforcement learning , author=. 2025 IEEE International Conference on LLM-Aided Design (ICLAD) , pages=. 2025 , organization=

work page 2025

[7] [7]

Proceedings of the Great Lakes Symposium on VLSI 2025 , pages=

HWFixBench: Benchmarking Tools for Hardware Understanding and Fault Repair , author=. Proceedings of the Great Lakes Symposium on VLSI 2025 , pages=. 2025 , publisher=

work page 2025

[8] [8]

2023 , volume=

Jimenez, Carlos E and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal=. 2023 , volume=

work page 2023

[9] [9]

arXiv preprint arXiv:2506.09003 , year=

SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner , author=. arXiv preprint arXiv:2506.09003 , year=

work page arXiv

[10] [10]

Swe-perf: Can language models optimize code performance on real-world repositories? arXiv preprint arXiv:2507.12415, 2025

Swe-perf: Can language models optimize code performance on real-world repositories? , author=. arXiv preprint arXiv:2507.12415 , year=

work page arXiv

[11] [11]

2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC) , pages=

Rtllm: An open-source benchmark for design rtl generation with large language model , author=. 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC) , pages=. 2024 , organization=

work page 2024

[12] [12]

Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD , pages=

Pyhdl-eval: An llm evaluation framework for hardware design using python-embedded dsls , author=. Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD , pages=. 2024 , publisher=

work page 2024

[13] [13]

2024 IEEE LLM Aided Design Workshop (LAD) , pages=

HDLEval benchmarking LLMs for multiple HDLs , author=. 2024 IEEE LLM Aided Design Workshop (LAD) , pages=. 2024 , organization=

work page 2024

[14] [14]

2024 IEEE LLM Aided Design Workshop (LAD) , pages=

Mg-verilog: Multi-grained dataset towards enhanced llm-assisted verilog generation , author=. 2024 IEEE LLM Aided Design Workshop (LAD) , pages=. 2024 , organization=

work page 2024

[15] [15]

arXiv preprint arXiv:2506.11110 , year=

AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models , author=. arXiv preprint arXiv:2506.11110 , year=

work page arXiv

[16] [16]

Advances in Neural Information Processing Systems , volume=

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , author=. Advances in Neural Information Processing Systems , volume=

work page

[17] [17]

Advances in Neural Information Processing Systems , volume=

Magis: Llm-based multi-agent framework for github issue resolution , author=. Advances in Neural Information Processing Systems , volume=

work page

[18] [18]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Openhands: An open platform for ai software developers as generalist agents , author=. arXiv preprint arXiv:2407.16741 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Agentless: Demystifying LLM-based Software Engineering Agents

Agentless: Demystifying llm-based software engineering agents , author=. arXiv preprint arXiv:2407.01489 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

2025 , volume=

Xie, Chengxing and Li, Bowen and Gao, Chang and Du, He and Lam, Wai and Zou, Difan and Chen, Kai , journal=. 2025 , volume=

work page 2025

[21] [21]

2025 , howpublished =

GPT 5.2 System Card , author =. 2025 , howpublished =

work page 2025

[22] [22]

2025 , month = nov, howpublished =

work page 2025

[23] [23]

2025 , howpublished =

work page 2025

[24] [24]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[27] [27]

Training Software Engineering Agents and Verifiers with SWE-Gym

Training software engineering agents and verifiers with swe-gym , author=. arXiv preprint arXiv:2412.21139 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2402.14323 , year=

Repofuse: Repository-level code completion with fused dual context , author=. arXiv preprint arXiv:2402.14323 , year=

work page arXiv

[30] [30]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author=. arXiv preprint arXiv:2401.14196 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2406.07003 , year=

Graphcoder: Enhancing repository-level code completion via code context graph-based retrieval and language model , author=. arXiv preprint arXiv:2406.07003 , year=

work page arXiv

[32] [32]

2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) , pages=

RepoChat: An LLM-Powered Chatbot for GitHub Repository Question-Answering , author=. 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) , pages=. 2025 , organization=

work page 2025

[33] [33]

Truong, Weixin Liang, Fan-Yun Sun, and Nick Haber

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code , author=. arXiv preprint arXiv:2506.02314 , year=

work page arXiv

[34] [34]

arXiv preprint arXiv:2404.17153 , year=

A unified debugging approach via llm-based multi-agent synergy , author=. arXiv preprint arXiv:2404.17153 , year=

work page arXiv

[35] [35]

arXiv preprint arXiv:2601.03708 , year=

MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion Benchmark , author=. arXiv preprint arXiv:2601.03708 , year=

work page arXiv

[36] [36]

arXiv preprint arXiv:2504.12268 , year=

HLS-Eval: A Benchmark and Framework for Evaluating LLMs on High-Level Synthesis Design Tasks , author=. arXiv preprint arXiv:2504.12268 , year=

work page arXiv

[37] [37]

2024 , organization=

Tsai, Yun-Da and Liu, Mingjie and Ren, Haoxing , booktitle=. 2024 , organization=

work page 2024

[38] [38]

2025 , volume=

Mu, Fangwen and Wang, Junjie and Shi, Lin and Wang, Song and Li, Shoubin and Wang, Qing , journal=. 2025 , volume=

work page 2025

[39] [39]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Aggarwal, Vaibhav and Kamal, Ojasv and Japesh, Abhinav and Jin, Zhijing and Sch. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page

[40] [40]

2025 , howpublished=

mini-SWE-agent: The 100-Line. 2025 , howpublished=

work page 2025

[41] [41]

arXiv preprint arXiv:2503.21710 , year=

Enhancing repository-level software repair via repository-aware knowledge graphs , author=. arXiv preprint arXiv:2503.21710 , year=

work page arXiv

[42] [42]

2025 , howpublished=

Lingxi: Open-Source Multi-Agent Framework for Repository-Level Issue Resolution , author=. 2025 , howpublished=

work page 2025

[43] [43]

Gauthier, Paul , year=. Aider:

work page

[44] [44]

ACM Transactions on Design Automation of Electronic Systems , volume=

Hdldebugger: Streamlining hdl debugging with large language models , author=. ACM Transactions on Design Automation of Electronic Systems , volume=. 2025 , publisher=

work page 2025

[45] [45]

Fixing hardware security bugs with large language models,

Fixing Hardware Security Bugs with Large Language Models , author=. arXiv preprint arXiv:2302.01215 , year=

work page arXiv

[46] [46]

2024 , volume=

Dong Chen and Shaoxin Lin and Muhan Zeng and Daoguang Zan and Jian-Gang Wang and Anton Cheshkov and Jun Sun and Hao Yu and Guoliang Dong and Artem Aliev and Jie Wang and Xiao Cheng and Guangtai Liang and Yuchi Ma and Pan Bian and Tao Xie and Qianxiang Wang , journal=. 2024 , volume=

work page 2024

[47] [47]

Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms , author=. arXiv preprint arXiv:2402.00386 , year=

work page arXiv

[48] [48]

2026 , howpublished=

work page 2026

[49] [49]

2002 , publisher=

Digital Integrated Circuits: A Design Perspective , author=. 2002 , publisher=

work page 2002

[50] [50]

2024 , howpublished =

Introducing. 2024 , howpublished =

work page 2024