arxiv: 2604.27415 · v1 · submitted 2026-04-30 · 💻 cs.LG

Recognition: unknown

ChipLingo: A Systematic Training Framework for Large Language Models in EDA

Jianguo Ni, Jian Zhao, Jieqiong Zhang, Junxuan Zhu, Lei Li, Xingwen Yu, Zhi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords Large language modelsElectronic design automationDomain adaptationPretrainingInstruction tuningRetrieval-augmented generationEDA benchmarkParameter-efficient fine-tuning

0 comments

The pith

A three-stage training pipeline adapts large language models to electronic design automation by curating domain data, performing adaptive pretraining, and aligning instructions to retrieval-augmented generation scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that general-purpose LLMs can acquire usable expertise in the document-heavy, tool-specific domain of EDA through a repeatable sequence of data preparation, pretraining, and targeted instruction tuning. A sympathetic reader would care because direct application of base models fails on cross-tool knowledge and retrieval performance drops after domain exposure, limiting automation in semiconductor design. The authors show that QA-augmented corpus construction, partial-parameter fine-tuning, and explicit RAG-scenario training together raise accuracy on their internal EDA-Bench to 59.7 percent for an 8B model and 70.02 percent for a 32B model. These figures exceed the corresponding base models and surpass some larger general-purpose LLMs while approaching closed-source commercial systems. The work therefore supplies both an empirical recipe and a benchmark for building reliable domain-adapted models in knowledge-intensive engineering fields.

Core claim

ChipLingo demonstrates that a systematic three-stage pipeline—domain corpus construction via multi-source curation and QA augmentation, domain-adaptive pretraining with comparisons of parameter strategies, and instruction alignment that includes RAG scenario training under varied retrieval conditions—produces LLMs with substantially improved performance on representative EDA tool tasks, as measured by 59.7 percent accuracy for the 8B variant and 70.02 percent for the 32B variant on the curated EDA-Bench.

What carries the argument

The ChipLingo three-stage pipeline that couples multi-source data curation plus QA augmentation, partial-parameter domain-adaptive pretraining, and explicit retrieval-augmented generation scenario instruction tuning.

If this is right

QA augmentation during corpus construction measurably improves domain task performance.
Partial fine-tuning achieves a better balance between domain adaptation and retention of general capabilities than LoRA.
Explicit training on diverse RAG retrieval conditions prevents the drop in retrieval utilization that otherwise follows domain pretraining.
The resulting models provide a practical foundation for future EDA agents and external-knowledge-driven systems.
Systematic domain training delivers concrete value on knowledge-intensive EDA tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Releasing EDA-Bench publicly would enable independent replication and comparison against other domain-adaptation methods on the same tasks.
The same staged pipeline could be tested on adjacent engineering domains that rely on tool documentation and retrieval, such as verification or physical design.
Further scaling the model size within the same pipeline might narrow the remaining gap to leading closed-source models without additional data changes.
Ablation studies that isolate each stage on the public benchmark would clarify which component contributes most to the observed accuracy lift.

Load-bearing premise

Performance gains arise specifically from the ordered three-stage pipeline rather than from data volume, model scale, or unstated implementation choices, and the internal EDA-Bench accurately captures real-world EDA tool scenarios.

What would settle it

Re-evaluating the trained models on an independent, publicly released set of EDA tool interaction traces or open EDA datasets and observing no gain over the base models of the same size would falsify the claim that the pipeline itself supplies the domain improvement.

read the original abstract

With the rapid advancement of semiconductor technology, Electronic Design Automation (EDA) has become an increasingly knowledge-intensive and document-driven engineering domain. Although large language models (LLMs) have shown strong general capabilities, applying them directly to EDA remains challenging due to limited domain expertise, cross-tool knowledge confusion, and degraded retrieval-augmented generation (RAG) performance after domain training. To address these issues, this paper presents ChipLingo, a systematic training pipeline for domain-adapted LLMs tailored to EDA scenarios. ChipLingo consists of three stages: domain corpus construction with multi-source data curation and QA augmentation, domain-adaptive pretraining with comparisons of different parameter training strategies, and instruction alignment with RAG scenario training under diverse retrieval conditions. We also curate an internal benchmark, EDA-Bench, covering representative EDA tool scenarios, with plans for public release. Experiments show that ChipLingo-8B achieves 59.7% accuracy on EDA-Bench, outperforming the same-scale base model and some larger general-purpose models. ChipLingo-32B reaches 70.02%, approaching leading closed-source commercial models. Further analysis shows that QA augmentation improves domain performance, Partial FT offers a better balance between adaptation and general capability retention than LoRA, and explicit RAG scenario training mitigates the decline in retrieval utilization after domain training. These results demonstrate the practical value of systematic domain training for knowledge-intensive EDA tasks and provide a foundation for future EDA agents and external-knowledge-driven systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ChipLingo, a three-stage training pipeline for domain-adapted LLMs in Electronic Design Automation (EDA). Stage 1 constructs a domain corpus via multi-source curation and QA augmentation; Stage 2 performs domain-adaptive pretraining with comparisons of parameter-efficient strategies; Stage 3 conducts instruction alignment including explicit RAG scenario training under varied retrieval conditions. The authors curate an internal EDA-Bench covering representative EDA tool scenarios (with public release planned) and report that ChipLingo-8B reaches 59.7% accuracy while ChipLingo-32B reaches 70.02%, outperforming same-scale base models and some larger general LLMs. Ablations indicate benefits from QA augmentation, Partial FT over LoRA for balancing adaptation and retention, and RAG training to preserve retrieval utilization.

Significance. If the empirical results hold under rigorous scrutiny, the work supplies a concrete, reproducible pipeline for adapting LLMs to knowledge-intensive engineering domains such as EDA. The explicit isolation of contributions via ablations on QA augmentation, training strategy, and RAG conditioning, together with the planned public EDA-Bench release, would constitute a useful foundation for subsequent EDA agents and retrieval-augmented systems.

major comments (3)

[Experiments] Experiments section: The central accuracy claims (59.7% for the 8B model and 70.02% for the 32B model) are presented without error bars, number of evaluation runs, statistical significance tests, or explicit description of the EDA-Bench test-set size and construction. These omissions are load-bearing because the paper's primary contribution is the measured performance lift over baselines.
[Benchmark Construction] EDA-Bench description: The benchmark is described only at high level as covering 'representative EDA tool scenarios.' Without details on scenario taxonomy, question generation process, or explicit checks for overlap with the training corpus, it is impossible to verify that the reported gains reflect genuine domain adaptation rather than benchmark contamination or narrow coverage.
[Ablation Studies] Ablation analysis: The claims that QA augmentation, Partial FT (versus LoRA), and RAG scenario training each contribute incremental value rest on comparisons whose controls for total data volume, training steps, and hyper-parameters are not quantified. This weakens attribution of the observed improvements specifically to the three-stage pipeline.

minor comments (2)

[Abstract] The abbreviation 'Partial FT' appears without an explicit definition or reference to the precise layers or parameters being updated; a short clarification would improve readability.
[Results] The manuscript should include a table summarizing all compared models (base, ChipLingo variants, larger general LLMs, closed-source) with their parameter counts, training regimes, and exact EDA-Bench scores for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor, benchmark transparency, and ablation controls that will improve the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Experiments] Experiments section: The central accuracy claims (59.7% for the 8B model and 70.02% for the 32B model) are presented without error bars, number of evaluation runs, statistical significance tests, or explicit description of the EDA-Bench test-set size and construction. These omissions are load-bearing because the paper's primary contribution is the measured performance lift over baselines.

Authors: We agree that these details are necessary to substantiate the performance claims. In the revised manuscript we will report results with error bars computed over multiple independent evaluation runs, explicitly state the number of runs and any statistical significance tests performed, and provide a full description of the EDA-Bench test-set size together with its construction methodology. These additions will be placed in the Experiments section to allow readers to assess the reliability of the reported lifts. revision: yes
Referee: [Benchmark Construction] EDA-Bench description: The benchmark is described only at high level as covering 'representative EDA tool scenarios.' Without details on scenario taxonomy, question generation process, or explicit checks for overlap with the training corpus, it is impossible to verify that the reported gains reflect genuine domain adaptation rather than benchmark contamination or narrow coverage.

Authors: We acknowledge that the current high-level description is insufficient. The revised version will expand the EDA-Bench subsection to include (1) the scenario taxonomy used to ensure representative coverage, (2) the question generation process (including sourcing and curation steps), and (3) the procedures employed to verify absence of overlap with the training corpus. Because public release of EDA-Bench is already planned, these details will also be documented in the accompanying release materials. revision: yes
Referee: [Ablation Studies] Ablation analysis: The claims that QA augmentation, Partial FT (versus LoRA), and RAG scenario training each contribute incremental value rest on comparisons whose controls for total data volume, training steps, and hyper-parameters are not quantified. This weakens attribution of the observed improvements specifically to the three-stage pipeline.

Authors: We agree that stronger controls are required for credible attribution. In the revised ablation studies we will explicitly report the total data volume, number of training steps, and hyper-parameter configurations used for each compared variant (QA augmentation, Partial FT vs. LoRA, RAG conditioning). This will enable direct, fair comparison and clearer isolation of the contribution of each stage in the pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical ML systems contribution that describes a three-stage training pipeline (domain corpus construction with QA augmentation, domain-adaptive pretraining, and instruction alignment with RAG scenario training) and reports measured accuracies on a newly curated internal benchmark (EDA-Bench). No equations, derivations, fitted parameters presented as predictions, or self-referential definitions appear in the provided text. Ablations isolate the effects of individual components (QA augmentation, Partial FT vs LoRA, RAG training), and comparisons are made to external base models and larger LLMs. The central claims rest on experimental outcomes rather than any reduction to inputs by construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work rests on standard machine-learning assumptions that curated domain data and targeted instruction tuning improve task performance; no new entities are postulated and no explicit free parameters beyond typical training choices are named.

free parameters (1)

Training hyperparameters (learning rate, epochs, etc.)
Typical ML training choices required to run the three stages but not enumerated in the abstract.

axioms (2)

domain assumption Domain-specific corpus construction plus QA augmentation improves LLM accuracy on EDA tasks
Invoked in the first and second stages of the pipeline.
domain assumption Explicit RAG scenario training prevents degradation of retrieval utilization after domain adaptation
Stated as a finding from the third stage.

pith-pipeline@v0.9.0 · 5588 in / 1596 out tokens · 61916 ms · 2026-05-07T09:36:27.575085+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 25 canonical work pages · 12 internal anchors

[1]

ChatEDA: A Large Language Model Powered Autonomous Agent for EDA

He Z, Wu H, Zhang X, et al. ChatEDA: A Large Language Model Powered Autonomous Agent for EDA. arXiv:2308.10204, 2023

work page arXiv 2023
[2]

ChipExpert: The Open-Source Integrated-Circuit-Design- Specific Large Language Model

Xu N, Zhang Z, Qi L, et al. ChipExpert: The Open-Source Integrated-Circuit-Design- Specific Large Language Model. arXiv:2408.00804, 2024

work page arXiv 2024
[3]

Chipnemo: Domain- adapted llms for chip design,

Liu M, Ene T D, Kirby R, et al. ChipNeMo: Domain-Adapted LLMs for Chip Design. arXiv:2311.00176, 2023

work page arXiv 2023
[4]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis P, Perez E, Piktus A, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401, 2020

work page internal anchor Pith review arXiv 2005
[5]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao Y, Xiong Y, Gao X, et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997, 2023

work page internal anchor Pith review arXiv 2023
[6]

Karakurt E, Akbulut A. Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) for Enterprise Knowledge Management and Document Automation: A Sys- tematic Literature Review.Applied Sciences, 16(1):368, 2026. doi:10.3390/app16010368

work page doi:10.3390/app16010368 2026
[7]

arXiv preprint arXiv:2403.03883 , year=

Colombo P, Pessoa Pires T, Boudiaf M, et al. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883, 2024. 20

work page arXiv 2024
[8]

Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA

Pu Y, He Z, Qiu T, et al. Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA. arXiv:2407.15353, 2024

work page arXiv 2024
[9]

VerilogEval: Evaluating Large Language Models for Verilog Code Generation

Thakur A, Chaturvedi S, Kothari P, et al. VerilogEval: Evaluating Large Language Models for Verilog Code Generation. arXiv:2309.07544, 2023

work page arXiv 2023
[10]

Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E

Zhang T, Patil S G, Jain N, et al. RAFT: Adapting Language Model to Domain Specific RAG. arXiv:2403.10131, 2024

work page arXiv 2024
[11]

Fine Tuning vs

Soudani H, Kanoulas E, Hasibi F. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. arXiv:2403.01432, 2024

work page arXiv 2024
[12]

iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation

Xu N, Zhang Z, Shu S, et al. iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation. arXiv:2603.04476, 2026

work page arXiv 2026
[13]

Parameter-Efficient Transfer Learning for NLP

Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-Efficient Transfer Learning for NLP. InProceedings of the 36th International Conference on Machine Learning (ICML),Pro- ceedings of Machine Learning Research97:2790–2799, 2019

2019
[14]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Li X L, Liang P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190, 2021

work page internal anchor Pith review arXiv 2021
[15]

LoRA: Low-Rank Adaptation of Large Language Models

Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, 2021

work page internal anchor Pith review arXiv 2021
[16]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Zhang Q, Chen M, Bukharin A, et al. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. arXiv:2303.10512, 2023

work page internal anchor Pith review arXiv 2023
[17]

QLoRA: Efficient Finetuning of Quantized LLMs

Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314, 2023

work page internal anchor Pith review arXiv 2023
[18]

In: Findings of the As- sociation for Computational Linguistics: EMNLP 2025

Pletenev S, Marina M, Moskovskiy D, et al. How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 4309–4322, 2025. doi:10.18653/v1/2025.findings- naacl.243

work page doi:10.18653/v1/2025.findings- 2025
[19]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Asai A, Wu Z, Wang Y, et al. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511, 2023

work page internal anchor Pith review arXiv 2023
[20]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Edge D, Trinh H, Cheng N, et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130, 2024

work page internal anchor Pith review arXiv 2024
[21]

HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

Sarmah B, Hall B, Rao R, et al. HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. arXiv:2408.04948, 2024. 21

work page arXiv 2024
[22]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng L, Chiang W L, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chat- bot Arena. InAdvances in Neural Information Processing Systems36 (Datasets and Bench- marks Track), 2023. arXiv:2306.05685

work page internal anchor Pith review arXiv 2023
[23]

Structural Generalization in COGS: Supertagging Is (Almost) All You Need,

Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C. G-Eval: NLG Evaluation using GPT-4 with Bet- ter Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 2511–2522, 2023. doi:10.18653/v1/2023.emnlp- main.153

work page doi:10.18653/v1/2023.emnlp- 2023
[24]

Instruction-Following Evaluation for Large Language Models

Zhou J, Lu T, Mishra S, et al. Instruction-Following Evaluation for Large Language Models. arXiv:2311.07911, 2023

work page internal anchor Pith review arXiv 2023
[25]

Measuring Short-Form Factuality in Large Language Models

OpenAI. Measuring Short-Form Factuality in Large Language Models. 2024. Available at: https://cdn.openai.com/papers/simpleqa.pdf

2024
[26]

Evaluating Large Language Models Trained on Code

Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[27]

Qwen3 Technical Report

Yang A, Li A, Yang B, et al. Qwen3 Technical Report. arXiv:2505.09388, 2025. 22

work page internal anchor Pith review arXiv 2025