IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing
Pith reviewed 2026-06-27 06:33 UTC · model grok-4.3
The pith
IterCAD frames CAD generation as closed-loop multi-turn agent interaction with an executable sandbox to enable iterative refinement from drawings or text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IterCAD is a unified multimodal agent framework for closed-loop interactive CAD generation and editing formulated as multi-turn interaction between the agent and an executable CAD sandbox. The approach rests on a data synthesis pipeline that creates standard-compliant multi-view drawings, complex code-editing tasks, and high-fidelity trajectories, followed by progressive supervised fine-tuning and geometry-aware reinforcement learning with viable-prefix masking. Evaluation on the introduced IterCAD-Bench uses the Chamfer Distance Tolerance-Recall curve and its AUC-TR metric to show higher code executability, geometric precision, and iterative refinement ability than existing approaches.
What carries the argument
the closed-loop multi-turn interaction between a multimodal agent and an executable CAD sandbox, trained first by progressive SFT then by geometry-aware RL with viable-prefix masking
If this is right
- Code generated by the agent runs successfully at higher rates on CAD interpreters.
- Final shapes lie closer to target geometry across tolerance levels tracked by the CD-TR curve.
- Multi-turn editing sessions converge faster and with fewer invalid steps than open-loop baselines.
- The AUC-TR metric supplies a single scalar that jointly scores validity and precision without discarding invalid outputs.
Where Pith is reading between the lines
- The same sandbox-loop structure could be applied to other parametric modeling environments that expose an execution API.
- Adding sensor feedback from physical prototypes into the reward signal would turn the loop into a full digital-twin controller.
- Extending the synthesis pipeline to assemblies of multiple parts would test whether the agent can maintain consistency across linked components.
Load-bearing premise
The data synthesis pipeline produces drawings, editing tasks, and interaction trajectories that match the distribution and difficulty of real industrial CAD work.
What would settle it
A head-to-head test on a held-out collection of actual engineer CAD sessions in which IterCAD requires the same or more refinement turns than one-shot baselines to reach an executable design of target geometry.
Figures
read the original abstract
Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents IterCAD, a multimodal agent for closed-loop CAD generation and editing formulated as multi-turn interaction with an executable sandbox across Drawing-to-Code, Text-to-Code, and Interactive Editing tasks. It introduces a data synthesis pipeline using industrial manufacturing features to create standard-compliant multi-view drawings, code-editing tasks, and interaction trajectories; optimizes the agent via progressive supervised fine-tuning followed by geometry-aware reinforcement learning with viable-prefix masking; proposes the IterCAD-Bench suite together with the Chamfer Distance Tolerance-Recall (CD-TR) curve and AUC-TR metric; and reports that the resulting system significantly outperforms prior methods on code executability and geometric precision while showing stronger closed-loop iterative refinement.
Significance. If the performance claims hold under rigorous validation, the work would advance automated CAD by shifting from open-loop one-shot generation to interactive, closed-loop refinement that better matches manufacturing practice. The CD-TR/AUC-TR metric is a constructive contribution that avoids survivor bias by jointly penalizing invalid code and geometric deviation. The combination of SFT and geometry-aware RL with masking is a reasonable technical approach for improving executability.
major comments (2)
- [Abstract, §3] Abstract and §3 (Data Synthesis Pipeline): the headline claims of outperformance on executability, geometric precision, and iterative refinement rest entirely on IterCAD-Bench, which is generated by the described pipeline. No quantitative validation (feature-distribution statistics, tolerance usage, editing-sequence length distributions, or comparison to any real industrial CAD corpus) is supplied to establish that the synthetic data are representative; without this, the reported gains on CD-TR/AUC-TR and closed-loop metrics could be artifacts of the benchmark construction rather than genuine capability.
- [§4, Tables 2-3] §4 (Experiments) and Table 2/3: the paper states that IterCAD “significantly outperforming existing approaches,” yet the abstract and available description provide no ablation isolating the contribution of viable-prefix masking versus standard RL, nor any statistical significance tests or variance estimates across the multiple benchmarks; these omissions make it impossible to determine whether the claimed superiority is robust or sensitive to post-hoc choices in the synthetic data.
minor comments (2)
- [§4.2] Notation for the CD-TR curve and AUC-TR should be defined with an explicit equation (e.g., recall at tolerance τ) rather than left to prose, to allow direct reproduction.
- [Figure 4] Figure captions for the interaction trajectories should include the exact number of turns and the success criterion used in the closed-loop evaluation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (Data Synthesis Pipeline): the headline claims of outperformance on executability, geometric precision, and iterative refinement rest entirely on IterCAD-Bench, which is generated by the described pipeline. No quantitative validation (feature-distribution statistics, tolerance usage, editing-sequence length distributions, or comparison to any real industrial CAD corpus) is supplied to establish that the synthetic data are representative; without this, the reported gains on CD-TR/AUC-TR and closed-loop metrics could be artifacts of the benchmark construction rather than genuine capability.
Authors: We agree that explicit quantitative validation of the synthetic data's alignment with real industrial distributions would strengthen the claims. The pipeline is designed around standard-compliant industrial manufacturing features (e.g., GD&T tolerances, multi-view projections, and feature-based modeling), but the current manuscript does not include feature-distribution histograms, tolerance-usage statistics, or sequence-length comparisons. In the revision we will add these analyses on the generated corpus and, where feasible, contrast them against publicly available CAD datasets to reduce the risk that performance gains are benchmark-specific artifacts. revision: yes
-
Referee: [§4, Tables 2-3] §4 (Experiments) and Table 2/3: the paper states that IterCAD “significantly outperforming existing approaches,” yet the abstract and available description provide no ablation isolating the contribution of viable-prefix masking versus standard RL, nor any statistical significance tests or variance estimates across the multiple benchmarks; these omissions make it impossible to determine whether the claimed superiority is robust or sensitive to post-hoc choices in the synthetic data.
Authors: We acknowledge the absence of these controls. The manuscript reports overall gains but does not isolate viable-prefix masking from standard RL nor supply run-to-run variance or significance tests. In the revised version we will add an ablation table comparing the full geometry-aware RL with viable-prefix masking against a standard RL baseline, together with standard deviations across multiple random seeds and paired statistical significance tests (e.g., Wilcoxon or t-tests) on the CD-TR/AUC-TR metrics. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical system description with no equations or derivations. It reports performance on multiple benchmarks (including self-introduced IterCAD-Bench generated via the described pipeline) after SFT and RL training. No load-bearing step reduces claimed results to fitted parameters, self-citations, or inputs by construction. The work is self-contained against external benchmarks and does not match any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Comparing 3d cad models: uses, methods, tools and perspectives.Computer-Aided Design and Applications, 9(6):771–794, 2012
Antoine Brière-Côté, Louis Rivest, and Roland Maranzana. Comparing 3d cad models: uses, methods, tools and perspectives.Computer-Aided Design and Applications, 9(6):771–794, 2012
2012
-
[2]
AU, Jeremy Wright, thebluedirt, Marcus Boyd, Lorenz, Innovations Technology Solutions, Hasan Yavuz ÖZDERYA, Bruno Agostini, Jojain, Michael Greminger, Seth Fischer, Justin Buchanan, cactrot, huskier, Ruben, iulianOnofrei (U-lee aan), Miguel Sánchez de León Peque, Martin Budden, Hecatron, Peter Boin, Wink Saville, Pavel M. Penev, Bryan Weissinger, M. Greys...
-
[3]
Haoyang Xie and Feng Ju. Text-to-cadquery: A new paradigm for cad generation with scalable large model capabilities.arXiv preprint arXiv:2505.06507, 2025
arXiv 2025
-
[4]
Cad translator: An effective drive for text to 3d parametric computer-aided design generative modeling
Xueyang Li, Yu Song, Yunzhong Lou, and Xiangdong Zhou. Cad translator: An effective drive for text to 3d parametric computer-aided design generative modeling. InProceedings of the 32nd ACM International Conference on Multimedia, pages 8461–8470, 2024
2024
-
[5]
Bo Yuan, Zelin Zhao, Petr Molodyk, Bin Hu, and Yongxin Chen. Clarify before you draw: Proactive agents for robust text-to-cad generation.arXiv preprint arXiv:2602.03045, 2026
Pith/arXiv arXiv 2026
-
[6]
Skexgen: Autoregressive generation of cad construction sequences with disentangled codebooks
Xiang Xu, Karl DD Willis, Joseph G Lambourne, Chin-Yi Cheng, Pradeep Kumar Jayaraman, and Yasutaka Furukawa. Skexgen: Autoregressive generation of cad construction sequences with disentangled codebooks. arXiv preprint arXiv:2207.04632, 2022
arXiv 2022
-
[7]
Jesse Barkley, Rumi Loghmani, and Amir Barati Farimani. Cadsmith: Multi-agent cad generation with programmatic geometric validation.arXiv preprint arXiv:2603.26512, 2026
arXiv 2026
-
[8]
Ke Niu, Haiyang Yu, Zhuofan Chen, Zhengtao Yao, Weitao Jia, Xiaodong Ge, Jingqun Tang, Benlei Cui, Bin Li, and Xiangyang Xue. Cme-cad: Heterogeneous collaborative multi-expert reinforcement learning for cad code generation.arXiv preprint arXiv:2512.23333, 2025
arXiv 2025
-
[9]
Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.Advances in Neural Information Processing Systems, 38:59765–59789, 2026
Yandong Guan, Xilin Wang, Ximing Xing, Jing Zhang, Dong Xu, and Qian Yu. Cad-coder: Text-to-cad generation with chain-of-thought and geometric reward.Advances in Neural Information Processing Systems, 38:59765–59789, 2026
2026
-
[10]
Maksim Kolodiazhnyi, Denis Tarasov, Dmitrii Zhemchuzhnikov, Alexander Nikulin, Ilya Zisman, Anna V orontsova, Anton Konushin, Vladislav Kurenkov, and Danila Rukhovich. cadrille: Multi-modal cad recon- struction with online reinforcement learning.arXiv preprint arXiv:2505.22914, 2025
arXiv 2025
-
[11]
Soslan Kabisov, Vsevolod Kirichuk, Andrey V olkov, Gennadii Savrasov, Marina Barannikov, Anton Konushin, Andrey Kuznetsov, and Dmitrii Zhemchuzhnikov. Cadreasoner: Iterative program editing for cad reverse engineering.arXiv preprint arXiv:2603.29847, 2026
arXiv 2026
-
[12]
Cad-judge: Toward efficient morphological grading and verification for text-to-cad generation
Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, and Shuyou Zhang. Cad-judge: Toward efficient morphological grading and verification for text-to-cad generation. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1021–1025. IEEE, 2026
2026
-
[13]
Giorgio Giannone, Anna Clare Doris, Amin Heyrani Nobari, Kai Xu, Akash Srivastava, and Faez Ahmed. Gift: Bootstrapping image-to-cad program synthesis via geometric feedback.arXiv preprint arXiv:2603.27448, 2026
arXiv 2026
-
[14]
Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024
Mohammad S Khan, Sankalp Sinha, Talha U Sheikh, Didier Stricker, Sk A Ali, and Muhammad Z Afzal. Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts.Advances in Neural Information Processing Systems, 37:7552–7579, 2024
2024
-
[15]
Deepcad: A deep generative network for computer-aided design models
Rundi Wu, Chang Xiao, and Changxi Zheng. Deepcad: A deep generative network for computer-aided design models. InProceedings of the IEEE/CVF international conference on computer vision, pages 6772–6782, 2021
2021
-
[16]
Secad-net: Self-supervised cad reconstruction by learning sketch-extrude operations
Pu Li, Jianwei Guo, Xiaopeng Zhang, and Dong-Ming Yan. Secad-net: Self-supervised cad reconstruction by learning sketch-extrude operations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16826, 2023. 10
2023
-
[17]
Inversecsg: Automatic conversion of 3d models to csg trees.ACM Transac- tions on Graphics (TOG), 37(6):1–16, 2018
Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, Daniela Rus, Armando Solar- Lezama, and Wojciech Matusik. Inversecsg: Automatic conversion of 3d models to csg trees.ACM Transac- tions on Graphics (TOG), 37(6):1–16, 2018
2018
-
[18]
Capri-net: Learning compact cad shapes with adaptive primitive assembly
Fenggen Yu, Zhiqin Chen, Manyi Li, Aditya Sanghi, Hooman Shayani, Ali Mahdavi-Amiri, and Hao Zhang. Capri-net: Learning compact cad shapes with adaptive primitive assembly. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11768–11778, 2022
2022
-
[19]
Solidgen: An autoregressive model for direct b-rep synthesis.arXiv preprint arXiv:2203.13944, 2022
Pradeep Kumar Jayaraman, Joseph G Lambourne, Nishkrit Desai, Karl DD Willis, Aditya Sanghi, and Nigel JW Morris. Solidgen: An autoregressive model for direct b-rep synthesis.arXiv preprint arXiv:2203.13944, 2022
arXiv 2022
-
[20]
Brepgen: A b-rep generative diffusion model with structured latent geometry.ACM Transactions on Graphics (TOG), 43(4):1–14, 2024
Xiang Xu, Joseph Lambourne, Pradeep Jayaraman, Zhengqing Wang, Karl Willis, and Yasutaka Furukawa. Brepgen: A b-rep generative diffusion model with structured latent geometry.ACM Transactions on Graphics (TOG), 43(4):1–14, 2024
2024
-
[21]
Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation
Jiahao Li, Weijian Ma, Xueyang Li, Yunzhong Lou, Guichun Zhou, and Xiangdong Zhou. Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18563–18573, 2025
2025
-
[22]
Xueyang Li, Jiahao Li, Yu Song, Yunzhong Lou, and Xiangdong Zhou. Seek-cad: A self-refined generative modeling for 3d parametric cad using local inference via deepseek.arXiv preprint arXiv:2505.17702, 2025
arXiv 2025
-
[23]
From intent to execution: Multimodal chain-of-thought reinforcement learning for precise cad code generation
Ke Niu, Haiyang Yu, Zhuofan Chen, Mengyang Zhao, Teng Fu, Bin Li, and Xiangyang Xue. From intent to execution: Multimodal chain-of-thought reinforcement learning for precise cad code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8160–8167, 2026
2026
-
[24]
Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences.ACM Transactions on Graphics (TOG), 40(4):1–24, 2021
Karl DD Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G Lambourne, Armando Solar-Lezama, and Wojciech Matusik. Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences.ACM Transactions on Graphics (TOG), 40(4):1–24, 2021
2021
-
[25]
Yu Yuan, Shizhao Sun, Qi Liu, and Jiang Bian. Cad-editor: A locate-then-infill framework with automated training data synthesis for text-based cad editing.arXiv preprint arXiv:2502.03997, 2025
arXiv 2025
-
[26]
Jiyuan An, Jiachen Zhao, Fan Chen, Liner Yang, Zhenghao Liu, Hongyan Wang, Weihua An, Meishan Zhang, and Erhong Yang. Pr-cad: Progressive refinement for unified controllable and faithful text-to-cad generation with large language models.arXiv preprint arXiv:2604.19773, 2026
Pith/arXiv arXiv 2026
-
[27]
Fengxiao Fan, Jingzhe Ni, Xiaolong Yin, Sirui Wang, Xingyu Lu, Qiang Zou, Ruofeng Tong, Min Tang, and Peng Du. Caddesigner: Conceptual design of cad models based on general-purpose agent.arXiv preprint arXiv:2508.01031, 2025
Pith/arXiv arXiv 2025
-
[28]
Yifei Gong, Xing Wu, Wenda Liu, and Kang Tu. Toolcad: Exploring tool-using large language models in text-to-cad generation with reinforcement learning.arXiv preprint arXiv:2604.07960, 2026
Pith/arXiv arXiv 2026
-
[29]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[30]
Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
Pith/arXiv arXiv 2025
-
[31]
Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025
arXiv 2025
-
[32]
Swift: a scalable lightweight infrastructure for fine-tuning
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025
2025
-
[33]
Qwen3.5: Accelerating productivity with native multimodal agents, February 2026
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
2026
-
[34]
Generating cad code with vision-language models for 3d designs
Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Haider Zaidi, Megan Langwasser, Wei Xu, and Matthew Gombolay. Generating cad code with vision-language models for 3d designs. InInternational Conference on Learning Representations, volume 2025, pages 52236–52262, 2025
2025
-
[35]
OpenAI. Gpt-5.https://openai.com/gpt-5, 2025. 11 Appendix Contents A Related Work 12 A.1 CAD Representations and Generation . . . . . . . . . . . . . . . . . . . . . . . . 12 A.2 Multi-Turn CAD Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B CAD Pairs Construction 13 B.1 Drawing-Code Pairs. . . . . . . . . . . . . . . . . . . . ....
arXiv 2025
-
[36]
2.Text: modeling instructions, dimensional constraints, or edit requests
Technical Drawing Image: orthographic projections such as Front, Top, Side, and ISO views with dimensions. 2.Text: modeling instructions, dimensional constraints, or edit requests. 3.Existing Code: a CadQuery script that should be preserved or modified when possible. Objective.Create or edit a 3D model that satisfies the user request. Output Format.Always...
-
[37]
– Plan: define the origin, workplanes, sketch sequence, Boolean operations, and key dimen- sions
Structure the<thinking>process strictly based on the provided inputs: •If no feedback is provided, such as initial generation or completely new instructions: – Requirement Analysis: break down visual or textual inputs into CadQuery features. – Plan: define the origin, workplanes, sketch sequence, Boolean operations, and key dimen- sions. •If feedback or e...
-
[38]
Code Implementation Rules
If feedback explicitly confirms that the 3D model is correct with no remaining issues, briefly state the assessment in<thinking></thinking>, then output<DONE>. Code Implementation Rules. • Use Python as the programming language. • Use CadQuery withimport cadquery as cq. • Assign the final result to variabler. • If scaling is needed, define scale_factor an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.