arxiv: 2605.10865 · v2 · submitted 2026-05-11 · 💻 cs.AI · cs.CV· cs.SE

Recognition: no theorem link

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Haozhe Zhang , Kaichen Liu , Miaomiao Chen , Lei Li , Shaojie Yang , Cheng Peng , Hanjie Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.SE

keywords CAD code generationparametric designindustrial CADmultimodal evaluationprogram synthesis3D modeling benchmarkCadQuery programsMLLM reasoning

0 comments

The pith

BenchCAD shows that current multimodal models recover only coarse outer shapes in industrial CAD parts and fail to generate accurate parametric code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BenchCAD, a large benchmark of execution-verified CadQuery programs spanning 106 industrial part families such as gears and springs. It tests models on tasks requiring perception of 3D structure, inference of engineering parameters, and selection of appropriate CAD operations from images or instructions. Evaluation across more than 10 frontier models reveals they often simplify complex features like sweeps and lofts into basic extrusions while missing fine details. This matters because programmatic CAD enables parametric reuse and precise manufacturing automation. The benchmark supports fine-grained diagnosis of where models succeed or fail in turning inputs into executable programs.

Core claim

BenchCAD demonstrates that while models can approximate the visible outer geometry of industrial parts from visual or textual inputs, they consistently fail to produce executable parametric CAD programs that capture the full 3D structure, correct engineering parameters, and the specific sequence of design operations such as sweeps, lofts, and twist-extrudes.

What carries the argument

BenchCAD, a unified benchmark of 17,900 execution-verified CadQuery programs across 106 industrial part families, evaluated through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing.

If this is right

Models recover coarse outer geometry but miss fine 3D structure in industrial parts.
Essential operations such as sweeps, lofts, and twist-extrudes are replaced by simpler sketch-and-extrude patterns.
Industrial design parameters are frequently misinterpreted.
Fine-tuning and reinforcement learning improve performance on seen part families.
Generalization to unseen part families remains limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could guide creation of models that better handle parametric reuse in manufacturing workflows.
Observed gaps in operation selection point to a need for training that emphasizes engineering intent over surface appearance.
Limited generalization suggests that expanding the set of part families with more parameter variations would be a direct next test.

Load-bearing premise

The 106 selected industrial part families and their programs are representative of the full diversity and complexity of real-world industrial CAD tasks.

What would settle it

A model that produces correct, executable parametric programs matching the ground-truth CadQuery code for a diverse set of previously unseen part families from the benchmark, without replacing complex operations with simpler ones.

Figures

Figures reproduced from arXiv: 2605.10865 by Cheng Peng, Hanjie Chen, Haozhe Zhang, Kaichen Liu, Lei Li, Miaomiao Chen, Shaojie Yang.

**Figure 1.** Figure 1: BenchCAD overview. BenchCAD is a unified, capability-decomposed evaluation framework for industrial CAD reasoning, consisting of 17,900 expert-verified parametric CadQuery parts (left) drawn from 106 industrial families spanning fasteners, transmission components, structural elements, fluid fittings, panels, hardware, and enclosures. The 7-category functional taxonomy (right) covers 49% of families anchor… view at source ↗

**Figure 2.** Figure 2: BenchCAD generation pipeline and task suite. Top: parts originate from industrystandard engineering designs (e.g., DIN 338 twist-drill cross-section), are realised as parameterised 3D geometry that respects standard parameter relations and physical priors, and are emitted as executable CadQuery code with verified geometry. Bottom: the four BenchCAD evaluation task categories operationalised on these parts… view at source ↗

**Figure 3.** Figure 3: BenchCAD-QA capability hierarchy. Four-level capabilities (L1 Holistic Visual Recognition → L4 Spatial/Code Reasoning) with paired VISION QA / CODE QA examples per level. fail to compile, exceed a 30 s runtime budget, or produce degenerate (zero or inverted) volume are quarantined (full failure-mode taxonomy in Appendix E). Every surviving render is then routed past a domain expert for visual sign-off, an… view at source ↗

**Figure 4.** Figure 4: Per-model performance across the four BenchCAD task categories (frontier proprietary subset; open-source baselines are reported in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: BenchCAD-Edit by task type [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of failures in codegen. Zoom-in for more details. Operation understanding. A twisted bracket (Figure 6C) is generated as two mutually perpendicular brackets without twisted connection; the requisite twist-extrusion is absent from the emitted program entirely. The model recognizes the holistic spatial (L1) but fails to map the visible torsion to the corresponding CAD operation, exposing an Op Voca… view at source ↗

**Figure 7.** Figure 7: The generalisation gap on BenchCAD. Qwen3-VL-2B trained on three data mixtures, evaluated on the BenchCAD validation set throughout training. (a) OOD IoU vs. training step. The IID-trained run (green) climbs highest; the OOD run (red, trained without the held-out family slice) plateaus mid-range; the baseline (grey, no BenchCAD) stays near the floor. (b) OOD essential-op pass rate. IID reaches the highest … view at source ↗

**Figure 8.** Figure 8: Per-model failure-mode distribution on BenchCAD-Edit. Each horizontal bar disaggregates one model’s predictions into ok, exec_fail, and the eight semantic failure modes F01–F08 defined in [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Per-model failure-layer distribution on BenchCAD-Edit. Each pie shows one model’s failures aggregated by capability layer L1–L4 (collapsing the eight F-codes via the mapping in [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: BenchCAD-Edit under three input protocols, by task type. Three protocols on the same 100-pair subset, four OpenAI models. text: original code + NL instruction (main bench, EDIT_CODE_SYSTEM_PROMPT). ablation: original code + NL instruction + a four-view render of the original part (EDIT_IMG_SYSTEM_PROMPT, App. L.4). image-only: original code + four-view render of the target solid, no NL instruction (EDIT_I… view at source ↗

**Figure 11.** Figure 11: Case 167: original vs. ground-truth target. The intended edit pierces every layer of the build chain. Ground truth (IoU = 1.0) result = ( cq. Workplane ("XY") . cylinder (19.9 , 69.6) # base flange . faces (">Z"). workplane () . hole (32.9) # original shaft hole . faces (">Z"). workplane () . circle (35.6) . extrude (10.4) # upper boss ( sits on flange ) . edges (">Z"). chamfer (1.1) . edges ("<Z"). fille… view at source ↗

**Figure 11.** Figure 11: Case 167: original vs. ground-truth target. The intended edit pierces every layer of the build chain. Idea. Build the whole part first, then .cut a r=20 cylinder of height 100 after the build chain closes. The oversized height (100≫30) guarantees the cut pierces both the 19.9 mm flange and the 10.4 mm boss in one call. GPT-5.3 (IoU = 0.961) result = ( cq. Workplane ("XY") . cylinder (19.9 , 69.6) . faces … view at source ↗

read the original abstract

Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BenchCAD gives a new large verified dataset for programmatic industrial CAD with clear failure patterns shown, but the 106 families lack validation against broader manufacturing data.

read the letter

The main thing to know is that this paper builds BenchCAD with 17,900 execution-verified CadQuery programs across 106 industrial part families and runs it on four tasks to test models on parametric CAD. It shows that current systems often get the outer shape right but drop fine details, misread parameters, and swap complex operations like sweeps and lofts for basic extrusions. Fine-tuning helps on the training families but not on new ones.

Referee Report

3 major / 2 minor

Summary. The paper introduces BenchCAD, a benchmark of 17,900 execution-verified CadQuery programs across 106 industrial part families (e.g., bevel gears, compression springs, twist drills). It supports four task types—visual question answering, code question answering, image-to-code generation, and instruction-guided editing—to assess multimodal models on perception, parametric abstraction, and executable program synthesis. Evaluation of 10+ frontier models shows they recover coarse outer geometry but fail to produce faithful parametric programs, with common errors including missing fine 3D structure, misinterpreting industrial parameters, and replacing sweeps/lofts/twist-extrudes by sketch-and-extrude patterns. Fine-tuning and RL improve in-distribution results, but generalization to unseen families is limited.

Significance. If the benchmark construction and evaluation protocols are validated, BenchCAD would constitute a substantial contribution by supplying the first large-scale, execution-verified industrial CAD benchmark that directly measures parametric program fidelity rather than surface geometry alone. The explicit cataloguing of failure modes (operation substitution, parameter misinterpretation) supplies concrete, falsifiable targets for future CAD automation work and the provision of reproducible CadQuery programs is a clear methodological strength.

major comments (3)

[§3] §3 (Dataset Construction): the claim that the 106 part families constitute an 'industry-standard' benchmark rests on unstated selection criteria and lacks any quantitative comparison (operation-type histograms, parameter-complexity distributions, or coverage statistics) against external manufacturing corpora; without this, the reported systematic failures could be artifacts of the chosen subset rather than general industrial behavior.
[§4] §4 (Evaluation Protocol): the abstract and results sections report performance gaps and failure-mode statistics but supply no information on train/test splits, inter-annotator or execution-verification procedures, statistical significance tests, or post-hoc derivation of the listed failure categories; these omissions make it impossible to assess whether the central claim of 'limited generalization' is robust.
[§5] §5 (Generalization Experiments): the statement that 'generalization to unseen part families remains limited' is load-bearing for the paper's positioning of BenchCAD, yet no quantitative breakdown is given of how 'unseen' families differ in operation distribution or complexity from the training families, weakening the evidential basis for the generalization conclusion.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use 'industry-standard' without a supporting definition or external reference; a brief clarification of the term would improve precision.
[Figures/Tables] Figure captions and table headers should explicitly state the number of samples per family and the exact CadQuery version used for verification to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript on BenchCAD. The feedback highlights important areas for improving the clarity and rigor of our dataset construction, evaluation protocols, and generalization analysis. We address each major comment below and commit to making the necessary revisions to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): the claim that the 106 part families constitute an 'industry-standard' benchmark rests on unstated selection criteria and lacks any quantitative comparison (operation-type histograms, parameter-complexity distributions, or coverage statistics) against external manufacturing corpora; without this, the reported systematic failures could be artifacts of the chosen subset rather than general industrial behavior.

Authors: We agree that explicit selection criteria and quantitative comparisons would better support the 'industry-standard' positioning. The families were selected to represent a broad spectrum of industrial components drawn from engineering design handbooks and common manufacturing practices, ensuring coverage of diverse CAD operations and parametric variations. In the revised manuscript, we will expand §3 to include a clear statement of the selection criteria (e.g., inclusion of key operations like extrude, sweep, loft, revolve; range of part complexities; representation across industries such as mechanical, automotive). We will also add operation-type histograms, parameter-complexity distributions, and coverage statistics. While comprehensive external manufacturing corpora are not publicly available for direct comparison, we will reference and contrast with existing open CAD datasets to contextualize our benchmark. These changes will help demonstrate that the observed failures are not artifacts of our selection. revision: yes
Referee: [§4] §4 (Evaluation Protocol): the abstract and results sections report performance gaps and failure-mode statistics but supply no information on train/test splits, inter-annotator or execution-verification procedures, statistical significance tests, or post-hoc derivation of the listed failure categories; these omissions make it impossible to assess whether the central claim of 'limited generalization' is robust.

Authors: We acknowledge that the evaluation protocol details were insufficiently described. The train/test splits were designed at the part-family level, with 80% of families for training/fine-tuning and 20% held out for testing generalization, ensuring no family overlap. All programs were execution-verified by running them in the CadQuery environment and confirming successful 3D model generation without runtime errors. Failure categories were identified through systematic manual review of model-generated programs by the research team, focusing on discrepancies in operations, parameters, and structure. In the revision, we will add a new subsection in §4 detailing the splits, verification procedures, any inter-annotator processes (for failure categorization), statistical tests (e.g., paired t-tests or bootstrap confidence intervals for performance differences), and the post-hoc analysis method for deriving failure modes. This will allow readers to better evaluate the robustness of our findings on limited generalization. revision: yes
Referee: [§5] §5 (Generalization Experiments): the statement that 'generalization to unseen part families remains limited' is load-bearing for the paper's positioning of BenchCAD, yet no quantitative breakdown is given of how 'unseen' families differ in operation distribution or complexity from the training families, weakening the evidential basis for the generalization conclusion.

Authors: We agree that providing a quantitative comparison between seen and unseen families is essential to substantiate the generalization results. Currently, the unseen families were chosen to include variations in operation types and complexities not fully represented in the training set. In the revised §5, we will include a quantitative breakdown, such as tables showing the distribution of CAD operations (e.g., percentage of programs using sweeps vs. extrudes), average parameter counts, program lengths, and other complexity metrics for both training and unseen families. This analysis will highlight the distributional shift and support our conclusion that generalization remains limited despite in-distribution improvements from fine-tuning and RL. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and empirical evaluations are self-contained

full rationale

The paper introduces a new dataset of 17,900 execution-verified CadQuery programs across 106 part families and reports direct model evaluations on VQA, code QA, image-to-code, and editing tasks. No equations, fitted parameters, predictions, or derivations are present; results are empirical observations on the newly constructed benchmark rather than reductions to prior inputs or self-citations. The central claims about model failure modes follow straightforwardly from running the models on the provided data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the domain assumption that CadQuery programs can be automatically executed to verify correctness and that the chosen part families capture industrial practice; no free parameters or new invented entities are introduced.

axioms (2)

domain assumption CAD programs written in CadQuery can be executed to confirm they produce valid geometry
The benchmark is built on execution-verified programs.
domain assumption The 106 part families are representative of industrial CAD usage
The abstract positions the dataset as covering reusable engineering designs.

pith-pipeline@v0.9.0 · 5583 in / 1358 out tokens · 78191 ms · 2026-05-13T02:47:16.650114+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

[1]

Wu, Rundi and Xiao, Chang and Zheng, Changxi , journal =

work page
[2]

Willis, Karl D. D. and Pu, Yewen and Luo, Jieliang and Chu, Hang and Du, Tao and Lambourne, Joseph G. and Solar-Lezama, Armando and Matusik, Wojciech , booktitle =

work page
[3]

Khan, Mohammad Sadil and Sinha, Sankalp and Uddin, Talha A. M. and Stricker, Didier and Ali, Sk Aziz and Afzal, Muhammad Zeshan , booktitle =

work page
[4]

2025 , eprint =

Guan, Yandong and Ge, Xilin and Yang, Shihao and Yang, Wenhao and Wei, Zhipeng and Cui, Can and Tang, Cheng and Zhang, Liang and Zhuang, Yueting , booktitle =. 2025 , eprint =

work page 2025
[5]

Rukhovich, Danila and Kolodiazhnyi, Kyrylo and Aouada, Djamila , booktitle =

work page
[6]

Kolodiazhnyi, Kyrylo and Rukhovich, Danila and Aouada, Djamila , booktitle =

work page
[7]

2602.16317 , archivePrefix =

Elistratov, Maksim and Barannikov, Marina and Ivanov, Gregory and Khrulkov, Valentin and Konushin, Anton and Kuznetsov, Andrey and Zhemchuzhnikov, Dmitrii , year =. 2602.16317 , archivePrefix =

work page arXiv
[8]

Alrashedy, Khaled and others , booktitle =

work page
[9]

2026 , eprint =

Yang, Sihan and Xu, Runsen and Xie, Yiman and Yang, Sizhe and Li, Mo and Lin, Jingli and Zhu, Chenming and Chen, Xiaochen and Duan, Haodong and Yue, Xiangyu and Lin, Dahua and Wang, Tai and Pang, Jiangmiao , booktitle =. 2026 , eprint =

work page 2026
[10]

2508.09101 , archivePrefix =

Chou, Jason and Liu, Ao and Deng, Yuchi and Zeng, Zhiying and Zhang, Tao and Zhu, Haotian and Cai, Jianwei and Mao, Yue and Zhang, Chenchen and Tan, Lingyun and Xu, Ziyan and Zhai, Bohui and Liu, Hengyi and Zhu, Speed and Zhou, Wiggin and Lian, Fengzong , year =. 2508.09101 , archivePrefix =

work page arXiv
[11]

2025 , eprint =

Yuan, Yu and Sun, Shizhao and Liu, Qi and Bian, Jiang , booktitle =. 2025 , eprint =

work page 2025
[12]

and Company, Pedro , journal =

Zhou, Jiwei and Camba, Jorge D. and Company, Pedro , journal =. 2025 , doi =

work page 2025
[13]

HistCAD: A Constraint-Aware Parametric History-Based CAD Representation, Dataset, and Benchmark with Industrial Complexity

Dong, Xintong and others , year =. 2602.19171 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Croissant: A Metadata Format for

Akhtar, Mubashara and others , booktitle =. Croissant: A Metadata Format for

work page
[15]

2024 , howpublished =

work page 2024
[16]

Qwen3-VL Technical Report

Bai, Shuai and others , year =. 2511.21631 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[17]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Chen, Zhe and others , year =. 2504.10479 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Chandiramani, Abhishek and others , year =. 2604.12374 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv