Recognition: unknown
ChipLingo: A Systematic Training Framework for Large Language Models in EDA
Pith reviewed 2026-05-07 09:36 UTC · model grok-4.3
The pith
A three-stage training pipeline adapts large language models to electronic design automation by curating domain data, performing adaptive pretraining, and aligning instructions to retrieval-augmented generation scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ChipLingo demonstrates that a systematic three-stage pipeline—domain corpus construction via multi-source curation and QA augmentation, domain-adaptive pretraining with comparisons of parameter strategies, and instruction alignment that includes RAG scenario training under varied retrieval conditions—produces LLMs with substantially improved performance on representative EDA tool tasks, as measured by 59.7 percent accuracy for the 8B variant and 70.02 percent for the 32B variant on the curated EDA-Bench.
What carries the argument
The ChipLingo three-stage pipeline that couples multi-source data curation plus QA augmentation, partial-parameter domain-adaptive pretraining, and explicit retrieval-augmented generation scenario instruction tuning.
If this is right
- QA augmentation during corpus construction measurably improves domain task performance.
- Partial fine-tuning achieves a better balance between domain adaptation and retention of general capabilities than LoRA.
- Explicit training on diverse RAG retrieval conditions prevents the drop in retrieval utilization that otherwise follows domain pretraining.
- The resulting models provide a practical foundation for future EDA agents and external-knowledge-driven systems.
- Systematic domain training delivers concrete value on knowledge-intensive EDA tasks.
Where Pith is reading between the lines
- Releasing EDA-Bench publicly would enable independent replication and comparison against other domain-adaptation methods on the same tasks.
- The same staged pipeline could be tested on adjacent engineering domains that rely on tool documentation and retrieval, such as verification or physical design.
- Further scaling the model size within the same pipeline might narrow the remaining gap to leading closed-source models without additional data changes.
- Ablation studies that isolate each stage on the public benchmark would clarify which component contributes most to the observed accuracy lift.
Load-bearing premise
Performance gains arise specifically from the ordered three-stage pipeline rather than from data volume, model scale, or unstated implementation choices, and the internal EDA-Bench accurately captures real-world EDA tool scenarios.
What would settle it
Re-evaluating the trained models on an independent, publicly released set of EDA tool interaction traces or open EDA datasets and observing no gain over the base models of the same size would falsify the claim that the pipeline itself supplies the domain improvement.
read the original abstract
With the rapid advancement of semiconductor technology, Electronic Design Automation (EDA) has become an increasingly knowledge-intensive and document-driven engineering domain. Although large language models (LLMs) have shown strong general capabilities, applying them directly to EDA remains challenging due to limited domain expertise, cross-tool knowledge confusion, and degraded retrieval-augmented generation (RAG) performance after domain training. To address these issues, this paper presents ChipLingo, a systematic training pipeline for domain-adapted LLMs tailored to EDA scenarios. ChipLingo consists of three stages: domain corpus construction with multi-source data curation and QA augmentation, domain-adaptive pretraining with comparisons of different parameter training strategies, and instruction alignment with RAG scenario training under diverse retrieval conditions. We also curate an internal benchmark, EDA-Bench, covering representative EDA tool scenarios, with plans for public release. Experiments show that ChipLingo-8B achieves 59.7% accuracy on EDA-Bench, outperforming the same-scale base model and some larger general-purpose models. ChipLingo-32B reaches 70.02%, approaching leading closed-source commercial models. Further analysis shows that QA augmentation improves domain performance, Partial FT offers a better balance between adaptation and general capability retention than LoRA, and explicit RAG scenario training mitigates the decline in retrieval utilization after domain training. These results demonstrate the practical value of systematic domain training for knowledge-intensive EDA tasks and provide a foundation for future EDA agents and external-knowledge-driven systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ChipLingo, a three-stage training pipeline for domain-adapted LLMs in Electronic Design Automation (EDA). Stage 1 constructs a domain corpus via multi-source curation and QA augmentation; Stage 2 performs domain-adaptive pretraining with comparisons of parameter-efficient strategies; Stage 3 conducts instruction alignment including explicit RAG scenario training under varied retrieval conditions. The authors curate an internal EDA-Bench covering representative EDA tool scenarios (with public release planned) and report that ChipLingo-8B reaches 59.7% accuracy while ChipLingo-32B reaches 70.02%, outperforming same-scale base models and some larger general LLMs. Ablations indicate benefits from QA augmentation, Partial FT over LoRA for balancing adaptation and retention, and RAG training to preserve retrieval utilization.
Significance. If the empirical results hold under rigorous scrutiny, the work supplies a concrete, reproducible pipeline for adapting LLMs to knowledge-intensive engineering domains such as EDA. The explicit isolation of contributions via ablations on QA augmentation, training strategy, and RAG conditioning, together with the planned public EDA-Bench release, would constitute a useful foundation for subsequent EDA agents and retrieval-augmented systems.
major comments (3)
- [Experiments] Experiments section: The central accuracy claims (59.7% for the 8B model and 70.02% for the 32B model) are presented without error bars, number of evaluation runs, statistical significance tests, or explicit description of the EDA-Bench test-set size and construction. These omissions are load-bearing because the paper's primary contribution is the measured performance lift over baselines.
- [Benchmark Construction] EDA-Bench description: The benchmark is described only at high level as covering 'representative EDA tool scenarios.' Without details on scenario taxonomy, question generation process, or explicit checks for overlap with the training corpus, it is impossible to verify that the reported gains reflect genuine domain adaptation rather than benchmark contamination or narrow coverage.
- [Ablation Studies] Ablation analysis: The claims that QA augmentation, Partial FT (versus LoRA), and RAG scenario training each contribute incremental value rest on comparisons whose controls for total data volume, training steps, and hyper-parameters are not quantified. This weakens attribution of the observed improvements specifically to the three-stage pipeline.
minor comments (2)
- [Abstract] The abbreviation 'Partial FT' appears without an explicit definition or reference to the precise layers or parameters being updated; a short clarification would improve readability.
- [Results] The manuscript should include a table summarizing all compared models (base, ChipLingo variants, larger general LLMs, closed-source) with their parameter counts, training regimes, and exact EDA-Bench scores for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor, benchmark transparency, and ablation controls that will improve the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central accuracy claims (59.7% for the 8B model and 70.02% for the 32B model) are presented without error bars, number of evaluation runs, statistical significance tests, or explicit description of the EDA-Bench test-set size and construction. These omissions are load-bearing because the paper's primary contribution is the measured performance lift over baselines.
Authors: We agree that these details are necessary to substantiate the performance claims. In the revised manuscript we will report results with error bars computed over multiple independent evaluation runs, explicitly state the number of runs and any statistical significance tests performed, and provide a full description of the EDA-Bench test-set size together with its construction methodology. These additions will be placed in the Experiments section to allow readers to assess the reliability of the reported lifts. revision: yes
-
Referee: [Benchmark Construction] EDA-Bench description: The benchmark is described only at high level as covering 'representative EDA tool scenarios.' Without details on scenario taxonomy, question generation process, or explicit checks for overlap with the training corpus, it is impossible to verify that the reported gains reflect genuine domain adaptation rather than benchmark contamination or narrow coverage.
Authors: We acknowledge that the current high-level description is insufficient. The revised version will expand the EDA-Bench subsection to include (1) the scenario taxonomy used to ensure representative coverage, (2) the question generation process (including sourcing and curation steps), and (3) the procedures employed to verify absence of overlap with the training corpus. Because public release of EDA-Bench is already planned, these details will also be documented in the accompanying release materials. revision: yes
-
Referee: [Ablation Studies] Ablation analysis: The claims that QA augmentation, Partial FT (versus LoRA), and RAG scenario training each contribute incremental value rest on comparisons whose controls for total data volume, training steps, and hyper-parameters are not quantified. This weakens attribution of the observed improvements specifically to the three-stage pipeline.
Authors: We agree that stronger controls are required for credible attribution. In the revised ablation studies we will explicitly report the total data volume, number of training steps, and hyper-parameter configurations used for each compared variant (QA augmentation, Partial FT vs. LoRA, RAG conditioning). This will enable direct, fair comparison and clearer isolation of the contribution of each stage in the pipeline. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical ML systems contribution that describes a three-stage training pipeline (domain corpus construction with QA augmentation, domain-adaptive pretraining, and instruction alignment with RAG scenario training) and reports measured accuracies on a newly curated internal benchmark (EDA-Bench). No equations, derivations, fitted parameters presented as predictions, or self-referential definitions appear in the provided text. Ablations isolate the effects of individual components (QA augmentation, Partial FT vs LoRA, RAG training), and comparisons are made to external base models and larger LLMs. The central claims rest on experimental outcomes rather than any reduction to inputs by construction, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
free parameters (1)
- Training hyperparameters (learning rate, epochs, etc.)
axioms (2)
- domain assumption Domain-specific corpus construction plus QA augmentation improves LLM accuracy on EDA tasks
- domain assumption Explicit RAG scenario training prevents degradation of retrieval utilization after domain adaptation
Reference graph
Works this paper leans on
-
[1]
ChatEDA: A Large Language Model Powered Autonomous Agent for EDA
He Z, Wu H, Zhang X, et al. ChatEDA: A Large Language Model Powered Autonomous Agent for EDA. arXiv:2308.10204, 2023
-
[2]
ChipExpert: The Open-Source Integrated-Circuit-Design- Specific Large Language Model
Xu N, Zhang Z, Qi L, et al. ChipExpert: The Open-Source Integrated-Circuit-Design- Specific Large Language Model. arXiv:2408.00804, 2024
-
[3]
Chipnemo: Domain- adapted llms for chip design,
Liu M, Ene T D, Kirby R, et al. ChipNeMo: Domain-Adapted LLMs for Chip Design. arXiv:2311.00176, 2023
-
[4]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Lewis P, Perez E, Piktus A, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401, 2020
work page internal anchor Pith review arXiv 2005
-
[5]
Retrieval-Augmented Generation for Large Language Models: A Survey
Gao Y, Xiong Y, Gao X, et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
Karakurt E, Akbulut A. Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) for Enterprise Knowledge Management and Document Automation: A Sys- tematic Literature Review.Applied Sciences, 16(1):368, 2026. doi:10.3390/app16010368
-
[7]
arXiv preprint arXiv:2403.03883 , year=
Colombo P, Pessoa Pires T, Boudiaf M, et al. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883, 2024. 20
-
[8]
Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA
Pu Y, He Z, Qiu T, et al. Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA. arXiv:2407.15353, 2024
-
[9]
VerilogEval: Evaluating Large Language Models for Verilog Code Generation
Thakur A, Chaturvedi S, Kothari P, et al. VerilogEval: Evaluating Large Language Models for Verilog Code Generation. arXiv:2309.07544, 2023
-
[10]
Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E
Zhang T, Patil S G, Jain N, et al. RAFT: Adapting Language Model to Domain Specific RAG. arXiv:2403.10131, 2024
-
[11]
Soudani H, Kanoulas E, Hasibi F. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. arXiv:2403.01432, 2024
-
[12]
Xu N, Zhang Z, Shu S, et al. iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation. arXiv:2603.04476, 2026
-
[13]
Parameter-Efficient Transfer Learning for NLP
Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-Efficient Transfer Learning for NLP. InProceedings of the 36th International Conference on Machine Learning (ICML),Pro- ceedings of Machine Learning Research97:2790–2799, 2019
2019
-
[14]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Li X L, Liang P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190, 2021
work page internal anchor Pith review arXiv 2021
-
[15]
LoRA: Low-Rank Adaptation of Large Language Models
Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, 2021
work page internal anchor Pith review arXiv 2021
-
[16]
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Zhang Q, Chen M, Bukharin A, et al. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. arXiv:2303.10512, 2023
work page internal anchor Pith review arXiv 2023
-
[17]
QLoRA: Efficient Finetuning of Quantized LLMs
Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314, 2023
work page internal anchor Pith review arXiv 2023
-
[18]
In: Findings of the As- sociation for Computational Linguistics: EMNLP 2025
Pletenev S, Marina M, Moskovskiy D, et al. How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 4309–4322, 2025. doi:10.18653/v1/2025.findings- naacl.243
-
[19]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Asai A, Wu Z, Wang Y, et al. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511, 2023
work page internal anchor Pith review arXiv 2023
-
[20]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Edge D, Trinh H, Cheng N, et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
Sarmah B, Hall B, Rao R, et al. HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. arXiv:2408.04948, 2024. 21
-
[22]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng L, Chiang W L, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chat- bot Arena. InAdvances in Neural Information Processing Systems36 (Datasets and Bench- marks Track), 2023. arXiv:2306.05685
work page internal anchor Pith review arXiv 2023
-
[23]
Structural Generalization in COGS: Supertagging Is (Almost) All You Need,
Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C. G-Eval: NLG Evaluation using GPT-4 with Bet- ter Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 2511–2522, 2023. doi:10.18653/v1/2023.emnlp- main.153
-
[24]
Instruction-Following Evaluation for Large Language Models
Zhou J, Lu T, Mishra S, et al. Instruction-Following Evaluation for Large Language Models. arXiv:2311.07911, 2023
work page internal anchor Pith review arXiv 2023
-
[25]
Measuring Short-Form Factuality in Large Language Models
OpenAI. Measuring Short-Form Factuality in Large Language Models. 2024. Available at: https://cdn.openai.com/papers/simpleqa.pdf
2024
-
[26]
Evaluating Large Language Models Trained on Code
Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021
work page internal anchor Pith review arXiv 2021
-
[27]
Yang A, Li A, Yang B, et al. Qwen3 Technical Report. arXiv:2505.09388, 2025. 22
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.