arxiv: 2605.05773 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt

Md Touhidul Islam , Sujan Kumar Saha , Farimah Farahmandi , Mark Tehranipoor

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords analog circuit designnetlist generationlanguage modelcircuit tokenizertopology synthesisEDA automationtransformer model

0 comments

The pith

A compact transformer model generates correct analog circuit netlists directly from natural language prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that analog circuit design can be automated by training a language model on pairs of natural language descriptions and corresponding netlists. The key step is a tokenizer that represents circuits by identifying common substructures, keeping the vocabulary size fixed no matter how large or complex the circuit becomes. This setup lets a relatively small model produce valid designs across many analog types. A reader would care because manual analog design is slow and expert-dependent, so reliable text-to-netlist generation could speed up electronics development. If the claim holds, it shows that domain-specific changes to how models process structured data can overcome limits of general-purpose approaches.

Core claim

CircuitFormer is an encoder-decoder transformer with 511 million parameters trained on 31,341 netlist-description pairs. It uses the CKT tokenizer, which mines frequent subcircuits to encode netlist connectivity, achieving constant vocabulary size instead of growth with circuit size. The model reaches 100 percent syntactic correctness and 83 percent functional success rate across major analog circuit categories, exceeding larger open-source models while using 240 times fewer parameters.

What carries the argument

Circuit Tokenizer (CKT): a tokenization method that encodes netlist connectivity by explicitly mining frequent subcircuits, producing a fixed vocabulary of 512 tokens and shorter sequences than standard byte-pair encoding.

If this is right

Analog circuit topologies can be produced directly from high-level text specifications without manual schematic entry.
Domain-adapted tokenization allows smaller models to handle graph-structured data like circuits more efficiently than general tokenizers.
Vocabulary size stays constant as circuit complexity grows, removing a scaling barrier for larger designs.
The public dataset supports further training or fine-tuning for additional circuit classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding the model in design tools could let engineers iterate by refining prompts based on quick simulation results.
Subcircuit-based tokenization may transfer to other graph domains such as molecular structures or network topologies.
The released dataset invites community extensions that add performance targets or layout constraints to the generation process.

Load-bearing premise

Automated simulation checks accurately determine whether a generated netlist will exhibit the intended analog behavior in practice.

What would settle it

Independent full analog simulations or expert review of a broad sample of generated netlists show functional success well below 83 percent, especially in categories involving noise, frequency response, or stability.

Figures

Figures reproduced from arXiv: 2605.05773 by Farimah Farahmandi, Mark Tehranipoor, Md Touhidul Islam, Sujan Kumar Saha.

**Figure 1.** Figure 1: Subcircuit mining and tokenization process. 2.1. Circuit as a Distinct Modality We argue that circuit generation is fundamentally different from code generation: while code is sequential and orderdependent, circuits are physical graphs where component ordering is irrelevant. Accordingly, circuits should be treated as a distinct modality within multimodal learning, analogous to images or video. CircuitFo… view at source ↗

**Figure 2.** Figure 2: The fully automated dataset making pipeline. Consequently, their vocabulary size |V| scales linearly as O(nmax), where nmax is the maximum component count in the dataset, leading to vocabulary explosion and an inability to generalize to circuits larger than those seen during training. In contrast, by decoupling topology from instance naming (mapping all {R1, . . . , Rk} → ⟨R⟩) and defining a fixed limit f… view at source ↗

**Figure 3.** Figure 3: The Overall Transformer architecture used. with subcircuit (‘X’) calls. This ensures that the model learns fundamental device physics rather than hierarchical abstractions. The resulting dataset of 18,011 valid netlists was split into a 94% training set and a 6% validation set (≈ 1, 000 samples). To prevent overfitting and enforce topological invariance, we applied the line-shuffling augmentation techniq… view at source ↗

**Figure 4.** Figure 4: , increasing the vocabulary size via subgraph mining directly correlates with higher compression view at source ↗

**Figure 5.** Figure 5: Impact of vocabulary size on generation quality. While validity remains robust at 100%, success rate peaks at |V| = 512, highlighting the trade-off between sequence compression and token sparsity. GPT-OSS. In the context of our 100% validity rate, this is a positive indicator. A low MMD often suggests the model is reproducing “average” circuits close to the training distribution. The higher MMD, coupled w… view at source ↗

**Figure 6.** Figure 6: CircuitFormer generating a ring oscillator from a natural language prompt. (a) The input user prompt. (b) The corresponding valid CMOS topology generated by CircuitFormer. Prompt: ”design a bjt differential amplifier with an active load current mirror and a tail current source”. (a) Input Prompt (b) Generated Circuit Topology view at source ↗

**Figure 7.** Figure 7: CircuitFormer generating a differential amplifier from a natural language prompt. (a) The input user prompt. (b) The corresponding valid topology generated by CircuitFormer. 11 view at source ↗

**Figure 8.** Figure 8: The system prompt used for the VLM to extract netlists from schematic images. 14 view at source ↗

**Figure 9.** Figure 9: shows how the dataset construction pipeline successfully generated a correct netlist and description from a given circuit schematic image. (a) Circuit Schematic (b) Source page for Context Netlist for the circuit: Differential amplifier with current mirror biasing .model NMOS NMOS .model PMOS PMOS Vdd 1 0 1.8 Iref 1 2 100u * NMOS bias circuit M0 2 2 0 0 NMOS W=10u L=1u MB1 3 2 0 0 NMOS W=20u L=1u MB2 4 2 0… view at source ↗

**Figure 10.** Figure 10: illustrates the distribution of circuit elements across the full dataset view at source ↗

**Figure 11.** Figure 11: Histogram of netlist sizes in the dataset. Frequency axis is logarithmic. 16 view at source ↗

read the original abstract

Automating analog circuit design remains a longstanding challenge in Electronic Design Automation (EDA). While Transformer-based Large Language Models (LLMs) have revolutionized software code generation, their application to analog hardware design is hindered by two critical limitations: (i) the scarcity of analog design datasets containing natural language description of a design and its corresponding netlist, and (ii) the inefficiency of general-purpose tokenizers (e.g., Byte Pair Encoding (BPE)) in capturing the inherent graph structure of circuits. To bridge this gap, first, we curate the largest annotated dataset of analog circuit netlists to date, comprising 31,341 netlist-natural language description pairs across all major circuit classes. Furthermore, we propose Circuit Tokenizer (CKT), a novel circuit graph tokenizer designed to encode netlist connectivity by explicitly mining frequent subcircuits. In terms of scalability, CKT overcomes the bottleneck of prior circuit graph serialization methods where vocabulary size scales linearly with maximum number of components in the dataset, n_max, (O(n_max)); instead, CKT decouples vocabulary growth from circuit complexity, achieving a constant O(1) complexity. Empirically, CKT outperforms standard BPE on circuit topology representation, reducing sequence length by 57% and achieving a 2.3x superior compression ratio using a compact, fixed vocabulary of size 512. Leveraging this optimized tokenization, we train a circuit-specific language model, CircuitFormer, a 511M parameter encoder-decoder transformer. Our model achieves 100% syntactic correctness and an 83% functional success rate across all major analog circuit categories, outperforming state-of-the-art open-source LLMs by 10% and 14%, respectively, while requiring 240x fewer parameters. The dataset is publicly available at https://huggingface.co/datasets/touhid314/cktformer-dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The real advance is the subcircuit tokenizer and the public dataset, but the 83% functional success claim needs explicit SPICE-level validation to hold up.

read the letter

The paper's main contribution is the Circuit Tokenizer that mines frequent subcircuits from netlists so vocabulary size stays fixed at 512 instead of growing with circuit size. They pair it with a new 31k-example dataset of netlist-language pairs and train a 511M-parameter encoder-decoder model. On their test set this produces 100% syntactically valid netlists and 83% functional success, beating larger open-source LLMs while using far fewer parameters. The dataset is released publicly, which is useful on its own.

Referee Report

2 major / 1 minor

Summary. The paper curates a dataset of 31,341 analog netlist-natural language description pairs, introduces the Circuit Tokenizer (CKT) that mines frequent subcircuits to achieve fixed vocabulary size 512 and O(1) complexity (vs. O(n_max) for prior methods), and trains CircuitFormer, a 511M-parameter encoder-decoder Transformer. It reports that the model attains 100% syntactic correctness and 83% functional success rate across major analog circuit categories, outperforming state-of-the-art open-source LLMs by 10% and 14% respectively while using 240x fewer parameters. The dataset is released publicly.

Significance. If the functional success metric is shown to correspond to verified analog performance via simulation, the work offers a parameter-efficient route to language-driven analog topology generation and supplies a useful public resource. The tokenizer's decoupling of vocabulary growth from circuit size is a clear technical contribution that could generalize to other graph-structured domains.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation sections: The headline 83% functional success rate is load-bearing for the central claim yet the manuscript provides no description of how functional correctness is assessed (e.g., whether SPICE simulations confirm target specifications such as gain, bandwidth, stability margins, or power, versus connectivity or heuristic rule checks). Without this protocol the reported rate cannot be interpreted as evidence that usable analog circuits are produced.
[Results] Results / Comparison: The 240x parameter reduction and 10%/14% outperformance claims require explicit identification of the compared open-source LLMs, their exact parameter counts, and confirmation that all models were evaluated on the identical held-out test set and task definition.

minor comments (1)

[Dataset] Dataset section: Clarify the process used to generate or annotate the natural-language descriptions to allow readers to assess potential biases in prompt style or circuit complexity distribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below and will revise the manuscript to improve clarity and completeness on the requested aspects.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation sections: The headline 83% functional success rate is load-bearing for the central claim yet the manuscript provides no description of how functional correctness is assessed (e.g., whether SPICE simulations confirm target specifications such as gain, bandwidth, stability margins, or power, versus connectivity or heuristic rule checks). Without this protocol the reported rate cannot be interpreted as evidence that usable analog circuits are produced.

Authors: We agree that a precise description of the functional success assessment protocol is necessary to support interpretation of the 83% rate as evidence of usable analog circuits. We will revise the Evaluation section to explicitly detail the protocol, including the criteria used for functional verification (such as any combination of SPICE-based simulation checks for specifications like gain, bandwidth, stability margins, and power, versus connectivity or heuristic rule-based validation). This addition will ensure the metric is fully transparent and interpretable. revision: yes
Referee: [Results] Results / Comparison: The 240x parameter reduction and 10%/14% outperformance claims require explicit identification of the compared open-source LLMs, their exact parameter counts, and confirmation that all models were evaluated on the identical held-out test set and task definition.

Authors: We acknowledge that the comparison claims would benefit from greater explicitness. We will revise the Results section to name the specific open-source LLMs used in the benchmarks, report their exact parameter counts, and add an explicit statement confirming that all models (including CircuitFormer) were evaluated on the identical held-out test set using the same task definition. This will fully substantiate the reported outperformance and parameter-efficiency figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML training and evaluation on held-out data

full rationale

The paper curates a dataset of 31,341 netlist-description pairs, introduces a frequency-based circuit tokenizer with fixed vocabulary, trains a 511M-parameter encoder-decoder model, and reports syntactic/functional success rates plus compression metrics on (presumably held-out) test circuits. All quantitative claims are direct empirical measurements from training and inference; no equations, predictions, or uniqueness theorems reduce to self-defined quantities, fitted inputs renamed as outputs, or load-bearing self-citations. The derivation chain consists of standard data curation, tokenization, supervised training, and benchmark evaluation against external LLMs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that frequent subcircuits form a sufficient basis for representing arbitrary analog topologies and that functional correctness can be assessed automatically without exhaustive simulation.

free parameters (1)

vocabulary size = 512
Fixed at 512 as a compact constant independent of maximum circuit size.

axioms (1)

domain assumption Netlist connectivity can be losslessly encoded by mining and tokenizing frequent subcircuits.
Invoked to justify the tokenizer design and its claimed O(1) complexity.

invented entities (1)

Circuit Tokenizer (CKT) no independent evidence
purpose: To encode circuit netlists by explicit frequent subcircuit mining.
New component introduced to overcome limitations of BPE on graph-structured data.

pith-pipeline@v0.9.0 · 5652 in / 1438 out tokens · 50188 ms · 2026-05-08T11:25:59.968538+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 6 canonical work pages · 1 internal anchor

[1]

2020 , eprint=

MPNet: Masked and Permuted Pre-training for Language Understanding , author=. 2020 , eprint=

2020
[2]

2025 , eprint=

Masala-CHAI: A Large-Scale SPICE Netlist Dataset for Analog Circuits by Harnessing AI , author=. 2025 , eprint=

2025
[3]

The Thirteenth International Conference on Learning Representations , year=

AnalogGenie: A Generative Engine for Automatic Discovery of Analog Circuit Topologies , author=. The Thirteenth International Conference on Learning Representations , year=
[4]

Grammar-based graph compression , journal =

Sebastian Maneth and Fabian Peternek , keywords =. Grammar-based graph compression , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.is.2018.03.002 , url =

work page doi:10.1016/j.is.2018.03.002 2018
[5]

doi: 10.18653/v1/D19-1410

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[6]

Language Models are Unsupervised Multitask Learners , author=
[7]

Forty-second International Conference on Machine Learning , year=

AnalogGenie-Lite: Enhancing Scalability and Precision in Circuit Topology Discovery through Lightweight Graph Modeling , author=. Forty-second International Conference on Machine Learning , year=
[8]

2024 , eprint=

AMSNet: Netlist Dataset for AMS Circuits , author=. 2024 , eprint=

2024
[9]

2025 , eprint=

AMSnet 2.0: A Large AMS Database with AI Segmentation for Net Detection , author=. 2025 , eprint=

2025
[10]

2024 , editor =

Chang, Chen-Chia and Shen, Yikang and Fan, Shaoze and Li, Jing and Zhang, Shun and Cao, Ningyuan and Chen, Yiran and Zhang, Xin , booktitle =. 2024 , editor =

2024
[11]

Chen-Chia Chang and Wan-Hsuan Lin and Yikang Shen and Yiran Chen and Xin Zhang , booktitle=. La. 2025 , url=

2025
[12]

Analogcoder: analog circuit design via training- free code generation,

Lai, Yao and Lee, Sungyoung and Chen, Guojin and Poddar, Souradip and Hu, Mengkang and Pan, David Z. and Luo, Ping , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligen...

work page doi:10.1609/aaai.v39i1.32016 2025
[13]

2025 , eprint=

AnalogCoder-Pro: Unifying Analog Circuit Generation and Optimization via Multi-modal LLMs , author=. 2025 , eprint=

2025
[14]

AnalogXpert: Automating Analog Topology Synthesis by Incorporating Circuit Design Expertise into Large Language Models , year=

Zhang, Haoyi and Sun, Shizhao and Lin, Yibo and Wang, Runsheng and Bian, Jiang , booktitle=. AnalogXpert: Automating Analog Topology Synthesis by Incorporating Circuit Design Expertise into Large Language Models , year=
[15]

Proceedings of the 61st ACM/IEEE Design Automation Conference , articleno =

Chen, Zihao and Huang, Jiangli and Liu, Yiting and Yang, Fan and Shang, Li and Zhou, Dian and Zeng, Xuan , title =. Proceedings of the 61st ACM/IEEE Design Automation Conference , articleno =. 2024 , isbn =. doi:10.1145/3649329.3655903 , abstract =

work page doi:10.1145/3649329.3655903 2024
[16]

2024 , eprint=

AmpAgent: An LLM-based Multi-Agent System for Multi-stage Amplifier Schematic Design from Literature for Process and Performance Porting , author=. 2024 , eprint=

2024
[17]

Zehao Dong and Weidong Cao and Muhan Zhang and Dacheng Tao and Yixin Chen and Xuan Zhang , booktitle=. Ckt. 2023 , url=

2023
[18]

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016
[19]

2024 , month = jul, howpublished =

Large Enough: Mistral Large 2 , author =. 2024 , month = jul, howpublished =

2024
[20]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[21]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[22]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[23]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[24]

2025 , eprint=

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=. 2025 , eprint=

2025
[25]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[26]

2024 , eprint=

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception , author=. 2024 , eprint=

2024