arxiv: 2605.06936 · v1 · submitted 2026-05-07 · 💻 cs.AR · cs.AI· cs.MA

Recognition: 1 theorem link

· Lean Theorem

Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

Pengju Liu , Nuo Xu , Jinwei Tang , Yu Cao , Caiwen Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.MA

keywords LLM agentsElectronic Design AutomationDRC fixingPPA optimizationBenchmarkPost-EDA tasks

0 comments

The pith

LLM agents succeed on simple post-EDA tasks but reach only 37% and 20% success on practical DRC reasoning and multi-objective PPA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PostEDA-Bench to measure how well LLM-based agents can repair residual design rule violations and meet power-performance-area targets after standard EDA tools complete their runs. It organizes 145 tasks into four categories of increasing realism: essential DRC fixes, reasoning about DRC violations, single-objective PPA, and multi-objective PPA, all evaluated automatically through actual EDA toolchains. Tests across eight commercial and open-source LLMs under various agent setups show solid results on synthetic or single-goal cases but steep declines to 36.66% best success on DRC-Reasoning and 20% on PPA-Multi. Vision inputs improve DRC performance, while the main barrier in complex PPA turns out to be reasoning through trade-offs rather than memorizing parameter settings.

Core claim

PostEDA-Bench is a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains that provide machine-checkable evaluation. Across eight LLMs and multiple agent scaffolds, agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning (best success rate 36.66%) and PPA-Multi (best success rate 20.00%). Vision augmentation consistently enhances DRC-Bench performance, and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.

What carries the argument

PostEDA-Bench, the hierarchical benchmark of 145 tasks in four categories with machine-checkable scoring via real EDA toolchains that directly measures agent success on DRC repair and PPA convergence.

If this is right

Agents must develop stronger contextual reasoning to identify and resolve DRC violations that require understanding design intent.
Multi-objective PPA work demands explicit mechanisms for weighing competing goals such as power versus timing.
Vision inputs should be included by default in agent pipelines targeting DRC-related tasks.
Benchmarks that separate synthetic from practical subtasks can guide targeted improvements in agent design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent architectures for hardware design could benefit from dedicated modules that simulate iterative trade-off analysis.
Extending the benchmark to earlier EDA stages such as placement or routing might expose similar reasoning bottlenecks.
Fully autonomous last-mile flows may still need hybrid setups that combine agents with classical optimization loops.
Comparable task hierarchies could be applied to other automated design domains to measure where current models fall short.

Load-bearing premise

The 145 tasks across the four categories in PostEDA-Bench are representative of real-world post-EDA challenges and provide a fair test of agent capabilities.

What would settle it

If a new agent method or model achieves over 70% success rate on both the DRC-Reasoning and PPA-Multi task sets while using the same benchmark and evaluation toolchain, the claimed performance gaps would no longer hold.

Figures

Figures reproduced from arXiv: 2605.06936 by Caiwen Ding, Jinwei Tang, Nuo Xu, Pengju Liu, Yu Cao.

**Figure 1.** Figure 1: Overview of POSTEDABENCH composition. EDA artifacts rather than source-code recall. For reproducibility and release, both DRC-Bench and PPA-Bench ship with inputs, prompts, metadata, pinned tool setup, and evaluation drivers; final labels are machine-checkable through deterministic EDA tools and report parsers. Detailed release contents, tool versions, regeneration requirements, and limitations are docume… view at source ↗

**Figure 2.** Figure 2: Overview of POSTEDA-BENCH construction process. DRC-clean design we inject 5–15 L1-style violations across rule families, mainly width/via (e.g., M2.W.1, M3.W.1, etc.). Sites are co-located so edits interact; each violation is single-step in isolation, but agents must order edits and re-query DRC since fixes can trigger or eliminate nearby ones. 3.1.3 DRC-Reasoning DRC-Reasoning targets practical residual … view at source ↗

**Figure 3.** Figure 3: Example prompts used to drive the agent in DRC-Bench and PPA-Bench. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Two-objective Pareto fronts for representative PPA-Multi designs. Gray dots denote non-Pareto solutions; blue stars indicate Pareto-optimal points. The combined panels summarize representative period–power and period–area trade-off fronts. let A = {i : Minit i > Mtgt i } be the violated metrics to improve and B = {i : Minit i ≤ Mtgt i } be constraints that must not regress; the per-metric score is si =  … view at source ↗

**Figure 5.** Figure 5: Effect of vision modality on DRC-Bench. SR and VRR are combined in each subfigure for four backbones under text-only and text+vision settings. overall PPA-Multi results. ORFS+Qwen-122B still leads PPA-Multi overall (20.00/67.80 SR/NIS), so structured exploration helps cover the Pareto frontier, but it does not remove the L2 trade-off gap. 4.3 Effect of Vision Modality on DRC-Bench We pair each text-only ba… view at source ↗

**Figure 6.** Figure 6: Effect of iteration cap and Reflexion on Gemma-4-31B-it. Left: DRC combines DRCEssential and DRC-Reasoning. Right: PPA combines PPA-Mono and PPA-Multi. Colors encode metrics; solid lines use the left y-axis and dashed lines use the right y-axis [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Full PPA-Bench per-level performance. SR (top) and NIS (bottom) per construction-time level. Row 1 decomposes PPA-Mono into Performance / Power / Area; row 2 summarizes PPA-Mono per sub-dimension and reports PPA-Multi. knowledge: when a structured sampler is grafted on top of a competent space, multi-objective SR jumps to first place; when it is grafted on top of a wrong space, it amplifies the error rathe… view at source ↗

read the original abstract

LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PostEDA-Bench adds a hierarchical test set for LLM agents on post-EDA DRC and PPA work, with clear drops on reasoning and multi-objective cases, but the tasks' link to real designs stays thin.

read the letter

The paper introduces PostEDA-Bench, a set of 145 tasks split into DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi. It runs eight LLMs through several agent setups and reports that simple synthetic cases go reasonably well while the more practical reasoning and trade-off versions fall to 36.66% and 20% success at best. Vision helps on DRC tasks, and the authors flag trade-off reasoning as the main limiter rather than missing parameter knowledge. Machine-checkable EDA toolchains back the metrics, which keeps the numbers grounded instead of subjective. This is new because earlier EDA-LLM benchmarks skipped DRC fixing and stayed flat. The hierarchy and the explicit multi-objective category give a sharper view of where agents break on realistic last-mile fixes. The quantitative splits across models and scaffolds are straightforward to read and point to concrete next steps for agent design. The soft spot is representativeness. The abstract gives no account of how the tasks were pulled together, whether they came from real tape-outs or were built synthetically, or how well the violation patterns and trade-offs match industrial distributions across nodes and styles. Without that, the sharp performance gaps and the bottleneck diagnosis rest on an untested assumption that these 145 cases stand in for production post-EDA work. If the instances over-weight artificial conflicts, the results will not travel. This is for groups building or evaluating LLM agents for hardware design automation. Readers who need data on current limits in DRC repair and PPA convergence will get usable numbers and a reusable testbed. It deserves a serious referee because the benchmark framework and the initial findings are concrete enough to spark follow-on work, even if the task construction needs more documentation and validation. Send it for review with a request for details on sourcing and diversity.

Referee Report

3 major / 2 minor

Summary. The paper introduces PostEDA-Bench, a hierarchical benchmark with 145 tasks across four categories (DRC-Essential, DRC-Reasoning, PPA-Mono, PPA-Multi) for evaluating LLM-based agents on post-EDA DRC fixing and PPA convergence. Supported by machine-checkable EDA toolchains, experiments across eight commercial and open-source LLMs and multiple agent scaffolds show agents handling synthetic/easy tasks reasonably well but degrading on practical ones, with best success rates of 36.66% on DRC-Reasoning and 20.00% on PPA-Multi; vision augmentation helps DRC tasks, and trade-off reasoning (not knob knowledge) is the main PPA-Multi bottleneck.

Significance. If the tasks prove representative, this benchmark fills a gap in EDA-LLM evaluation by focusing on the post-tool 'last mile' with objective metrics, highlighting current agent limitations and guiding targeted improvements in reasoning and multi-objective optimization for circuit design automation. The machine-checkable evaluation via EDA toolchains is a clear strength for reproducibility.

major comments (3)

[§3] §3 (Benchmark Construction): The 145 tasks lack explicit details on sourcing (e.g., extraction from real tape-outs versus synthetic generation), diversity metrics across process nodes or design styles, and expert validation that violation patterns and trade-offs match industrial distributions. This is load-bearing for the central claims, as the reported sharp degradation (36.66% DRC-Reasoning, 20% PPA-Multi) and attribution to reasoning bottlenecks assume the tasks represent practical post-EDA challenges.
[§4] §4 (Experiments and Results): Success rates are reported as single point estimates without statistical controls such as number of independent runs per task, variance, confidence intervals, or significance testing. This undermines the reliability of the degradation observations and the conclusion that trade-off reasoning is the dominant bottleneck in PPA-Multi.
[§4.2] §4.2 (Agent Scaffolds and Vision): The description of how vision augmentation is integrated and its consistent enhancement on DRC-Bench lacks ablation details on prompt engineering, image resolution, or failure modes, making it difficult to isolate whether gains stem from visual DRC pattern recognition or other factors.

minor comments (2)

[Abstract and §1] The abstract and introduction could more explicitly compare PostEDA-Bench to prior EDA-LLM benchmarks (e.g., in terms of hierarchy and machine-checkable metrics) to clarify novelty.
[Figures and Tables] Figure captions and table headers should include more detail on the exact success-rate definitions (e.g., whether partial fixes count) to aid interpretation without referring to the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review of our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor while maintaining the integrity of our benchmark and experimental claims.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The 145 tasks lack explicit details on sourcing (e.g., extraction from real tape-outs versus synthetic generation), diversity metrics across process nodes or design styles, and expert validation that violation patterns and trade-offs match industrial distributions. This is load-bearing for the central claims, as the reported sharp degradation (36.66% DRC-Reasoning, 20% PPA-Multi) and attribution to reasoning bottlenecks assume the tasks represent practical post-EDA challenges.

Authors: We agree that additional transparency on benchmark construction would strengthen the paper. The tasks are synthetically generated using open-source PDKs and EDA toolchains to ensure full machine-checkability and reproducibility, with DRC-Essential tasks using basic single-violation patterns and DRC-Reasoning tasks incorporating multi-violation contextual patterns drawn from common post-EDA scenarios. In the revised manuscript, we will expand §3 with: explicit sourcing details (synthetic generation based on standard rule patterns rather than direct real tape-out extraction due to IP constraints); diversity metrics (coverage across process nodes via 7nm/14nm/45nm open PDK equivalents and design styles including combinational logic, sequential circuits, and arithmetic units); and validation notes (patterns curated to align with industrial distributions via internal EDA expert review). This supports our claims, as performance degradation tracks increasing task complexity rather than artificial simplicity. revision: yes
Referee: [§4] §4 (Experiments and Results): Success rates are reported as single point estimates without statistical controls such as number of independent runs per task, variance, confidence intervals, or significance testing. This undermines the reliability of the degradation observations and the conclusion that trade-off reasoning is the dominant bottleneck in PPA-Multi.

Authors: The referee correctly notes the absence of statistical controls in the reported results. Each task-LLM-scaffold combination was evaluated once owing to the substantial compute required for LLM calls and full EDA tool flows. The degradation trends remain consistent across all eight models and scaffolds, and the trade-off reasoning bottleneck is evidenced by targeted ablations (separating knob selection from multi-objective optimization). In revision, we will add a limitations subsection discussing this and include variance estimates plus confidence intervals from three independent runs on a 20% random subset of tasks to quantify reliability without prohibitive cost. revision: partial
Referee: [§4.2] §4.2 (Agent Scaffolds and Vision): The description of how vision augmentation is integrated and its consistent enhancement on DRC-Bench lacks ablation details on prompt engineering, image resolution, or failure modes, making it difficult to isolate whether gains stem from visual DRC pattern recognition or other factors.

Authors: We will revise §4.2 to provide the requested details. Vision augmentation is integrated by rendering DRC violation maps as images and appending them to the agent prompt with a fixed template instructing the model to analyze spatial patterns. In the update, we will include ablations on prompt variants, image resolutions (tested at 256x256 and 512x512), and explicit failure mode analysis (e.g., vision aids detection of dense local violations but offers limited help on global routing conflicts). These additions will clarify that gains primarily arise from improved visual pattern recognition rather than ancillary prompt effects. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct experimental results

full rationale

The paper creates PostEDA-Bench (145 tasks across four categories) and reports LLM/agent success rates from direct runs on commercial/open-source toolchains with machine-checkable evaluation. No equations, derivations, fitted parameters, predictions, or first-principles claims exist. Performance numbers (e.g., 36.66% on DRC-Reasoning, 20% on PPA-Multi) are measured outcomes, not reductions of prior definitions or self-citations. Representativeness of tasks is an external validity concern but does not create circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's central contribution is the new benchmark and its empirical use; the main unverified premise is task representativeness. No free parameters or mathematical derivations are involved.

axioms (1)

domain assumption The tasks defined in PostEDA-Bench are representative of real-world post-EDA DRC and PPA challenges
This assumption underpins the claim that the observed performance gaps reflect practical agent limitations.

invented entities (1)

PostEDA-Bench hierarchical benchmark no independent evidence
purpose: To provide standardized, machine-checkable tasks for evaluating LLM agents on DRC fixing and PPA convergence
Newly introduced contribution with no independent evidence or external validation mentioned outside this work.

pith-pipeline@v0.9.0 · 5505 in / 1617 out tokens · 118556 ms · 2026-05-11T00:57:42.284462+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce POSTEDA-BENCH, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 7 canonical work pages · 1 internal anchor

[1]

X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching large language models to self-debug. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=KuPixIqPiq

2024
[2]

Clark, L., V

T. Clark, L., V . Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric. Asap7: A 7-nm finfet predictive process design kit.Microelectronics Journal, 53(—): 105–115, 2016. doi: 10.1016/j.mejo.2016.04.006

work page doi:10.1016/j.mejo.2016.04.006 2016
[3]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: an embodied multimodal language model. InProceedings of the 40th International Confere...

2023
[4]

Y . Fu, Y . Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y . C. Lin. Gpt4aigchip: Towards next-generation AI accelerator design automation via large language models. InIEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, October 28 - Nov. 2, 2023, pages 1–9. IEEE, 2023. doi: 10.1109/ICCAD57390.2023.10323953. UR...

work page doi:10.1109/iccad57390.2023.10323953 2023
[5]

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig. Pal: program- aided language models. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

2023
[6]

ORFS-agent: Tool-Using Agents for Chip Design Optimization

A. Ghose, A. B. Kahng, S. Kundu, and Z. Wang. Orfs-agent: Tool-using agents for chip design optimization, 2025. URLhttps://arxiv.org/abs/2506.08332

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=VtmBAGCN7o

2024
[8]

Jiang, Q

Z. Jiang, Q. Zhang, C. Liu, L. Cheng, H. Li, and X. Li. Iicpilot: An intelligent integrated circuit backend design framework using open eda, 2024. URL https://arxiv.org/abs/2407.1 2576

2024
[9]

M. Liu, N. Pinckney, B. Khailany, and H. Ren. VerilogEval: evaluating large language models for verilog code generation. In2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2023

2023
[10]

Chipnemo: Domain- adapted llms for chip design,

M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktaroglu, B. Bhaskaran, B. Catanzaro, A. Chaudhuri, S. Clay, B. Dally, L. Dang, P. Deshpande, S. Dhodhi, S. Halepete, E. Hill, J. Hu, S. Jain, A. Jindal, B. Khailany, G. Kokai, K. Kunal, X. Li, C. Lind, H. Liu, S. Oberman, S. Omar, G. Pasandi, S. Pratty,...

work page arXiv 2024
[11]

Y . Lu, S. Liu, Q. Zhang, and Z. Xie. Rtllm: An open-source benchmark for design rtl generation with large language model. In2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 722–727. IEEE, 2024. 10

2024
[12]

Y . Lu, H. I. Au, J. Zhang, J. Pan, Y . Wang, A. Li, J. Zhang, and Y . Chen. Autoeda: Enabling eda flow automation through microservice-based llm agents, 2025. URL https://arxiv.or g/abs/2508.01012

work page arXiv 2025
[13]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[14]

Opencores: Home, 2026

OpenCores.org. Opencores: Home, 2026. URLhttps://opencores.org/

2026
[15]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessí, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

2023
[16]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[17]

Y . Wang, W. Ye, Y . He, Y . Chen, G. Qu, and A. Li. Mcp4eda: Llm-powered model context protocol rtl-to-gdsii automation with backend aware synthesis optimization, 2025. URL https: //arxiv.org/abs/2507.19570

work page arXiv 2025
[18]

H. Wu, Z. He, X. Zhang, X. Yao, S. Zheng, H. Zheng, and B. Yu. ChatEDA: A Large Language Model Powered Autonomous Agent for EDA.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024

2024
[19]

H. Wu, H. Zheng, Z. He, and B. Yu. Divergent Thoughts toward One Goal: LLM-based Multi-Agent Collaboration System for Electronic Design Automation. InAnnual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025

2025
[20]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

2023
[21]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing rea- soning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[22]

Zhang, J

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin. CodeAgent: Enhancing code generation with tool- integrated agent systems for real-world repo-level coding challenges. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, Bangkok, Tha...

work page doi:10.18653/v1/2024.acl-long.737 2024
[23]

The engineer loads the violation in KLayout and inspects the offending geometry alongside its surrounding layers
[24]

The engineer manually constructs the shortest fix sequence using only the editing operations exposed to the agent (add_shape,change_shape,move_cell); inspection-only tool calls are not counted
[25]

The integer count of editing tool calls is recorded as the case’s step count
[26]

The protocol does not produce inter-annotator agreement scores; it is a within-annotator self- consistency procedure

After a one-week delay the engineer re-labels the same case without access to the prior label; cases with disagreeing labels are re-evaluated under the protocol until convergence, and any case that still admits competing minimal fixes is retained at the smaller step count. The protocol does not produce inter-annotator agreement scores; it is a within-anno...

2012
[27]

Limitations

over the same tool surface as the ReAct PPA agent. At each tree node the agent samples PARALLEL_NODE candidate next actions in parallel, an LLM judge scores them, the highest-scored candidate is executed, and the resulting state becomes a new node. Un-executed candidates from every previously-expanded node remain in a global priority queue, so when the cu...

2024
[28]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...