pith. machine review for the scientific record. sign in

arxiv: 2604.27415 · v1 · submitted 2026-04-30 · 💻 cs.LG

Recognition: unknown

ChipLingo: A Systematic Training Framework for Large Language Models in EDA

Jianguo Ni, Jian Zhao, Jieqiong Zhang, Junxuan Zhu, Lei Li, Xingwen Yu, Zhi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords Large language modelsElectronic design automationDomain adaptationPretrainingInstruction tuningRetrieval-augmented generationEDA benchmarkParameter-efficient fine-tuning
0
0 comments X

The pith

A three-stage training pipeline adapts large language models to electronic design automation by curating domain data, performing adaptive pretraining, and aligning instructions to retrieval-augmented generation scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that general-purpose LLMs can acquire usable expertise in the document-heavy, tool-specific domain of EDA through a repeatable sequence of data preparation, pretraining, and targeted instruction tuning. A sympathetic reader would care because direct application of base models fails on cross-tool knowledge and retrieval performance drops after domain exposure, limiting automation in semiconductor design. The authors show that QA-augmented corpus construction, partial-parameter fine-tuning, and explicit RAG-scenario training together raise accuracy on their internal EDA-Bench to 59.7 percent for an 8B model and 70.02 percent for a 32B model. These figures exceed the corresponding base models and surpass some larger general-purpose LLMs while approaching closed-source commercial systems. The work therefore supplies both an empirical recipe and a benchmark for building reliable domain-adapted models in knowledge-intensive engineering fields.

Core claim

ChipLingo demonstrates that a systematic three-stage pipeline—domain corpus construction via multi-source curation and QA augmentation, domain-adaptive pretraining with comparisons of parameter strategies, and instruction alignment that includes RAG scenario training under varied retrieval conditions—produces LLMs with substantially improved performance on representative EDA tool tasks, as measured by 59.7 percent accuracy for the 8B variant and 70.02 percent for the 32B variant on the curated EDA-Bench.

What carries the argument

The ChipLingo three-stage pipeline that couples multi-source data curation plus QA augmentation, partial-parameter domain-adaptive pretraining, and explicit retrieval-augmented generation scenario instruction tuning.

If this is right

  • QA augmentation during corpus construction measurably improves domain task performance.
  • Partial fine-tuning achieves a better balance between domain adaptation and retention of general capabilities than LoRA.
  • Explicit training on diverse RAG retrieval conditions prevents the drop in retrieval utilization that otherwise follows domain pretraining.
  • The resulting models provide a practical foundation for future EDA agents and external-knowledge-driven systems.
  • Systematic domain training delivers concrete value on knowledge-intensive EDA tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Releasing EDA-Bench publicly would enable independent replication and comparison against other domain-adaptation methods on the same tasks.
  • The same staged pipeline could be tested on adjacent engineering domains that rely on tool documentation and retrieval, such as verification or physical design.
  • Further scaling the model size within the same pipeline might narrow the remaining gap to leading closed-source models without additional data changes.
  • Ablation studies that isolate each stage on the public benchmark would clarify which component contributes most to the observed accuracy lift.

Load-bearing premise

Performance gains arise specifically from the ordered three-stage pipeline rather than from data volume, model scale, or unstated implementation choices, and the internal EDA-Bench accurately captures real-world EDA tool scenarios.

What would settle it

Re-evaluating the trained models on an independent, publicly released set of EDA tool interaction traces or open EDA datasets and observing no gain over the base models of the same size would falsify the claim that the pipeline itself supplies the domain improvement.

read the original abstract

With the rapid advancement of semiconductor technology, Electronic Design Automation (EDA) has become an increasingly knowledge-intensive and document-driven engineering domain. Although large language models (LLMs) have shown strong general capabilities, applying them directly to EDA remains challenging due to limited domain expertise, cross-tool knowledge confusion, and degraded retrieval-augmented generation (RAG) performance after domain training. To address these issues, this paper presents ChipLingo, a systematic training pipeline for domain-adapted LLMs tailored to EDA scenarios. ChipLingo consists of three stages: domain corpus construction with multi-source data curation and QA augmentation, domain-adaptive pretraining with comparisons of different parameter training strategies, and instruction alignment with RAG scenario training under diverse retrieval conditions. We also curate an internal benchmark, EDA-Bench, covering representative EDA tool scenarios, with plans for public release. Experiments show that ChipLingo-8B achieves 59.7% accuracy on EDA-Bench, outperforming the same-scale base model and some larger general-purpose models. ChipLingo-32B reaches 70.02%, approaching leading closed-source commercial models. Further analysis shows that QA augmentation improves domain performance, Partial FT offers a better balance between adaptation and general capability retention than LoRA, and explicit RAG scenario training mitigates the decline in retrieval utilization after domain training. These results demonstrate the practical value of systematic domain training for knowledge-intensive EDA tasks and provide a foundation for future EDA agents and external-knowledge-driven systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ChipLingo, a three-stage training pipeline for domain-adapted LLMs in Electronic Design Automation (EDA). Stage 1 constructs a domain corpus via multi-source curation and QA augmentation; Stage 2 performs domain-adaptive pretraining with comparisons of parameter-efficient strategies; Stage 3 conducts instruction alignment including explicit RAG scenario training under varied retrieval conditions. The authors curate an internal EDA-Bench covering representative EDA tool scenarios (with public release planned) and report that ChipLingo-8B reaches 59.7% accuracy while ChipLingo-32B reaches 70.02%, outperforming same-scale base models and some larger general LLMs. Ablations indicate benefits from QA augmentation, Partial FT over LoRA for balancing adaptation and retention, and RAG training to preserve retrieval utilization.

Significance. If the empirical results hold under rigorous scrutiny, the work supplies a concrete, reproducible pipeline for adapting LLMs to knowledge-intensive engineering domains such as EDA. The explicit isolation of contributions via ablations on QA augmentation, training strategy, and RAG conditioning, together with the planned public EDA-Bench release, would constitute a useful foundation for subsequent EDA agents and retrieval-augmented systems.

major comments (3)
  1. [Experiments] Experiments section: The central accuracy claims (59.7% for the 8B model and 70.02% for the 32B model) are presented without error bars, number of evaluation runs, statistical significance tests, or explicit description of the EDA-Bench test-set size and construction. These omissions are load-bearing because the paper's primary contribution is the measured performance lift over baselines.
  2. [Benchmark Construction] EDA-Bench description: The benchmark is described only at high level as covering 'representative EDA tool scenarios.' Without details on scenario taxonomy, question generation process, or explicit checks for overlap with the training corpus, it is impossible to verify that the reported gains reflect genuine domain adaptation rather than benchmark contamination or narrow coverage.
  3. [Ablation Studies] Ablation analysis: The claims that QA augmentation, Partial FT (versus LoRA), and RAG scenario training each contribute incremental value rest on comparisons whose controls for total data volume, training steps, and hyper-parameters are not quantified. This weakens attribution of the observed improvements specifically to the three-stage pipeline.
minor comments (2)
  1. [Abstract] The abbreviation 'Partial FT' appears without an explicit definition or reference to the precise layers or parameters being updated; a short clarification would improve readability.
  2. [Results] The manuscript should include a table summarizing all compared models (base, ChipLingo variants, larger general LLMs, closed-source) with their parameter counts, training regimes, and exact EDA-Bench scores for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor, benchmark transparency, and ablation controls that will improve the manuscript. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central accuracy claims (59.7% for the 8B model and 70.02% for the 32B model) are presented without error bars, number of evaluation runs, statistical significance tests, or explicit description of the EDA-Bench test-set size and construction. These omissions are load-bearing because the paper's primary contribution is the measured performance lift over baselines.

    Authors: We agree that these details are necessary to substantiate the performance claims. In the revised manuscript we will report results with error bars computed over multiple independent evaluation runs, explicitly state the number of runs and any statistical significance tests performed, and provide a full description of the EDA-Bench test-set size together with its construction methodology. These additions will be placed in the Experiments section to allow readers to assess the reliability of the reported lifts. revision: yes

  2. Referee: [Benchmark Construction] EDA-Bench description: The benchmark is described only at high level as covering 'representative EDA tool scenarios.' Without details on scenario taxonomy, question generation process, or explicit checks for overlap with the training corpus, it is impossible to verify that the reported gains reflect genuine domain adaptation rather than benchmark contamination or narrow coverage.

    Authors: We acknowledge that the current high-level description is insufficient. The revised version will expand the EDA-Bench subsection to include (1) the scenario taxonomy used to ensure representative coverage, (2) the question generation process (including sourcing and curation steps), and (3) the procedures employed to verify absence of overlap with the training corpus. Because public release of EDA-Bench is already planned, these details will also be documented in the accompanying release materials. revision: yes

  3. Referee: [Ablation Studies] Ablation analysis: The claims that QA augmentation, Partial FT (versus LoRA), and RAG scenario training each contribute incremental value rest on comparisons whose controls for total data volume, training steps, and hyper-parameters are not quantified. This weakens attribution of the observed improvements specifically to the three-stage pipeline.

    Authors: We agree that stronger controls are required for credible attribution. In the revised ablation studies we will explicitly report the total data volume, number of training steps, and hyper-parameter configurations used for each compared variant (QA augmentation, Partial FT vs. LoRA, RAG conditioning). This will enable direct, fair comparison and clearer isolation of the contribution of each stage in the pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical ML systems contribution that describes a three-stage training pipeline (domain corpus construction with QA augmentation, domain-adaptive pretraining, and instruction alignment with RAG scenario training) and reports measured accuracies on a newly curated internal benchmark (EDA-Bench). No equations, derivations, fitted parameters presented as predictions, or self-referential definitions appear in the provided text. Ablations isolate the effects of individual components (QA augmentation, Partial FT vs LoRA, RAG training), and comparisons are made to external base models and larger LLMs. The central claims rest on experimental outcomes rather than any reduction to inputs by construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work rests on standard machine-learning assumptions that curated domain data and targeted instruction tuning improve task performance; no new entities are postulated and no explicit free parameters beyond typical training choices are named.

free parameters (1)
  • Training hyperparameters (learning rate, epochs, etc.)
    Typical ML training choices required to run the three stages but not enumerated in the abstract.
axioms (2)
  • domain assumption Domain-specific corpus construction plus QA augmentation improves LLM accuracy on EDA tasks
    Invoked in the first and second stages of the pipeline.
  • domain assumption Explicit RAG scenario training prevents degradation of retrieval utilization after domain adaptation
    Stated as a finding from the third stage.

pith-pipeline@v0.9.0 · 5588 in / 1596 out tokens · 61916 ms · 2026-05-07T09:36:27.575085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    ChatEDA: A Large Language Model Powered Autonomous Agent for EDA

    He Z, Wu H, Zhang X, et al. ChatEDA: A Large Language Model Powered Autonomous Agent for EDA. arXiv:2308.10204, 2023

  2. [2]

    ChipExpert: The Open-Source Integrated-Circuit-Design- Specific Large Language Model

    Xu N, Zhang Z, Qi L, et al. ChipExpert: The Open-Source Integrated-Circuit-Design- Specific Large Language Model. arXiv:2408.00804, 2024

  3. [3]

    Chipnemo: Domain- adapted llms for chip design,

    Liu M, Ene T D, Kirby R, et al. ChipNeMo: Domain-Adapted LLMs for Chip Design. arXiv:2311.00176, 2023

  4. [4]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Lewis P, Perez E, Piktus A, et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401, 2020

  5. [5]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao Y, Xiong Y, Gao X, et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997, 2023

  6. [6]

    Karakurt E, Akbulut A. Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) for Enterprise Knowledge Management and Document Automation: A Sys- tematic Literature Review.Applied Sciences, 16(1):368, 2026. doi:10.3390/app16010368

  7. [7]

    arXiv preprint arXiv:2403.03883 , year=

    Colombo P, Pessoa Pires T, Boudiaf M, et al. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883, 2024. 20

  8. [8]

    Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA

    Pu Y, He Z, Qiu T, et al. Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA. arXiv:2407.15353, 2024

  9. [9]

    VerilogEval: Evaluating Large Language Models for Verilog Code Generation

    Thakur A, Chaturvedi S, Kothari P, et al. VerilogEval: Evaluating Large Language Models for Verilog Code Generation. arXiv:2309.07544, 2023

  10. [10]

    Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E

    Zhang T, Patil S G, Jain N, et al. RAFT: Adapting Language Model to Domain Specific RAG. arXiv:2403.10131, 2024

  11. [11]

    Fine Tuning vs

    Soudani H, Kanoulas E, Hasibi F. Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. arXiv:2403.01432, 2024

  12. [12]

    iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation

    Xu N, Zhang Z, Shu S, et al. iScript: A Domain-Adapted Large Language Model and Benchmark for Physical Design Tcl Script Generation. arXiv:2603.04476, 2026

  13. [13]

    Parameter-Efficient Transfer Learning for NLP

    Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-Efficient Transfer Learning for NLP. InProceedings of the 36th International Conference on Machine Learning (ICML),Pro- ceedings of Machine Learning Research97:2790–2799, 2019

  14. [14]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Li X L, Liang P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190, 2021

  15. [15]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu E J, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, 2021

  16. [16]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Zhang Q, Chen M, Bukharin A, et al. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. arXiv:2303.10512, 2023

  17. [17]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314, 2023

  18. [18]

    In: Findings of the As- sociation for Computational Linguistics: EMNLP 2025

    Pletenev S, Marina M, Moskovskiy D, et al. How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? InFindings of the Association for Computa- tional Linguistics: NAACL 2025, pages 4309–4322, 2025. doi:10.18653/v1/2025.findings- naacl.243

  19. [19]

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    Asai A, Wu Z, Wang Y, et al. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511, 2023

  20. [20]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Edge D, Trinh H, Cheng N, et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130, 2024

  21. [21]

    HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

    Sarmah B, Hall B, Rao R, et al. HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. arXiv:2408.04948, 2024. 21

  22. [22]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng L, Chiang W L, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chat- bot Arena. InAdvances in Neural Information Processing Systems36 (Datasets and Bench- marks Track), 2023. arXiv:2306.05685

  23. [23]

    Structural Generalization in COGS: Supertagging Is (Almost) All You Need,

    Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C. G-Eval: NLG Evaluation using GPT-4 with Bet- ter Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 2511–2522, 2023. doi:10.18653/v1/2023.emnlp- main.153

  24. [24]

    Instruction-Following Evaluation for Large Language Models

    Zhou J, Lu T, Mishra S, et al. Instruction-Following Evaluation for Large Language Models. arXiv:2311.07911, 2023

  25. [25]

    Measuring Short-Form Factuality in Large Language Models

    OpenAI. Measuring Short-Form Factuality in Large Language Models. 2024. Available at: https://cdn.openai.com/papers/simpleqa.pdf

  26. [26]

    Evaluating Large Language Models Trained on Code

    Chen M, Tworek J, Jun H, et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021

  27. [27]

    Qwen3 Technical Report

    Yang A, Li A, Yang B, et al. Qwen3 Technical Report. arXiv:2505.09388, 2025. 22