pith. machine review for the scientific record. sign in

arxiv: 2604.16170 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.CE

Recognition: unknown

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

Matthew Bouchard, Toby Perrett, William McCarthy

Pith reviewed 2026-05-10 09:07 UTC · model grok-4.3

classification 💻 cs.CV cs.CE
keywords 3D CAD editingmultimodal instructionsbenchmark datasetfoundation modelsCAD workflowshuman evaluationmodel comparison
0
0 comments X

The pith

A benchmark built from real designer sessions shows foundation models trail experts by a wide margin on 3D CAD edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates neuralCAD-Edit, the first benchmark for 3D CAD model editing drawn directly from professional engineers. It records videos of ten designers working in CAD software, capturing their spoken instructions, pointing, and drawings as they specify changes to models. Leading foundation models are then compared to human experts performing the same edits, measured by both automatic scores and human acceptance rates. The results show a substantial gap, with the strongest model scoring 53 percent lower in acceptance than the experts. The benchmark is positioned as a reference point for improving AI systems that handle authentic multimodal CAD editing tasks.

Core claim

The authors establish that realistic editing instructions collected from expert CAD designers through direct video capture of their software interactions create a benchmark where current foundation models achieve markedly lower human acceptance rates than the designers themselves, with the best model 53 percent lower in absolute terms.

What carries the argument

The central mechanism is the capture of multimodal editing requests by recording professional designers interacting with live CAD models, including speech, pointing, and drawing to convey precise changes.

If this is right

  • AI approaches for 3D CAD editing can now be tested against authentic expert instructions rather than synthetic text prompts.
  • Progress in multimodal models can be tracked using both geometric metrics and direct human judgments of edit usefulness.
  • The benchmark supplies a concrete target for closing the observed performance difference between models and experts.
  • Development of specialized CAD tools can focus on handling combined verbal, visual, and spatial cues from real workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may need explicit training on spatial reasoning and manufacturing constraints that designers implicitly apply during edits.
  • The gap could narrow if future systems incorporate interactive feedback loops similar to how designers refine their own changes.
  • Expanding the dataset across more CAD software platforms would test whether the current performance difference generalizes.

Load-bearing premise

The editing requests gathered from ten professional designers are representative of typical expert CAD workflows and that human acceptance trials reliably indicate practical edit quality.

What would settle it

A follow-up experiment in which models trained or evaluated on the benchmark data reach acceptance rates close to those of human experts when tested on new editing requests from additional designers.

Figures

Figures reproduced from arXiv: 2604.16170 by Matthew Bouchard, Toby Perrett, William McCarthy.

Figure 1
Figure 1. Figure 1: A human CAD expert is given an input 3D CAD model, and gives a multimodal edit Request whilst interacting with the model. Edits are then carried out by the requestor, and an additional CAD expert, to produce Output CAD models. This example highlights that even precise requests can contain some ambiguity. neuralCAD￾Edit captures four modality combinations for requests, and benchmarks the ability of AI appro… view at source ↗
Figure 2
Figure 2. Figure 2: For a given CAD model, experts requested edits that they thought would take professional designers 2 (easy), 5 (medium), and 10 (hard) minutes to complete. Video frames with transcribed speech snippets from the request are shown. Both simple (top) and complex (bottom) models can be the source of simple and complex edits. All requests and edits are captured whilst participants are using Autodesk Fusion. A c… view at source ↗
Figure 3
Figure 3. Figure 3: Example requests from each modality and selected viewport renders from the associated human edits. The text request shows the typed text and the input model. The three video modalities show snippets of transcribed speech and one video frame. 2.3 Input CAD models In order to capture editing on the different CAD model types designers en￾counter, we consider two input axes: (1) Parameterisation: whether a mod… view at source ↗
Figure 4
Figure 4. Figure 4: Request (a)-(b) and edit (c)-(d) statistics, grouped by modality (x-axis groups) and request difficulty: easy, medium, and hard. 30FPS to keep file-sizes sensible. Speech is automatically transcribed by Whis￾per [46]. In addition to these automatic transcripts, we also provide oracle tran￾scriptions - automatic transcriptions are manually checked, with errors corrected by the original requestor. This proce… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples of edits performed by humans and AIs. Models particu￾larly struggled with compositional visual reasoning, resulting in mistakes like adding spokes that don’t connect to the main assembly, or adding propellers at the wrong set of corners. GPT 5.2 shows the most promise, successfully performing the edits in the bottom four rows. Video examples provided on the project webpage [PITH_FULL_… view at source ↗
Figure 6
Figure 6. Figure 6: Acceptance scores for all models, broken down by modality and difficulty. this was not the case for Claude, Gemini or GPT, which suggests there is room for improvement in their multimodal understanding, in particular simultaneous talking and drawing. Difficulty. Figure 6b shows the acceptance scores, broken down by difficulty. The human baseline performs similarly well on all difficulties. GPT and Claude p… view at source ↗
Figure 7
Figure 7. Figure 7: Efficiency of humans, Claude, Gemini and GPT. Averages are shown as crosses. Humans complete their edits faster, but cost more. Illustrative pairs of requests (left images) and edits (right images) are highlighted to show range of edit qualities. Text is high-level summary of the request. show limited correlation with human ratings. This limited alignment indicates that current VLM-based evaluators do not … view at source ↗
Figure 8
Figure 8. Figure 8: Time taken to perform easy, medium and hard edits. Humans clearly spent longer editing the harder the request, whereas this effect is much less pronounced for the best AI (GPT), and not observed for Gemini and Claude. Human VLM Auto Quality Instruction Quality DINOv2 IoU Chamfer Instruction Quality Instruction Quality DINOv2 IoU [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Spearman rank correlations be￾tween human ratings, VLM ratings (GPT 5.2), and automatic metrics. “Remove this pipe” high DINOv2 similarity low human rating high human rating Human GT GPT-5.2 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

We introduce neuralCAD-Edit, the first benchmark for editing 3D CAD models collected from expert CAD engineers. Instead of text conditioning as in prior works, we collect realistic CAD editing requests by capturing videos of professional designers, interacting directly with CAD models in CAD software, while talking, pointing and drawing. We recruited ten consenting designers to contribute to this contained study. We benchmark leading foundation models against human CAD experts carrying out edits, and find a large performance gap in both automatic metrics and human evaluations. Even the best foundation model (GPT 5.2) scores 53% lower (absolute) than CAD experts in human acceptance trials, demonstrating the challenge of neuralCAD-Edit. We hope neuralCAD-Edit will provide a solid foundation against which 3D CAD editing approaches and foundation models can be developed. Code/data: https://autodeskailab.github.io/neuralCAD-Edit

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces neuralCAD-Edit, the first benchmark for multimodal-instructed 3D CAD model editing. Editing requests are collected from ten professional designers via videos of direct interactions in CAD software (including speech, pointing, and drawing). Leading foundation models are benchmarked against human CAD experts on the same tasks, with the key result that the best model (GPT 5.2) scores 53% lower (absolute) than experts in human acceptance trials. The work positions the benchmark and released data/code as a foundation for future 3D CAD editing research.

Significance. If the collected requests and evaluation protocols hold, the benchmark would be a useful contribution to multimodal 3D modeling and computer vision by moving beyond synthetic or text-only instructions to real expert workflows. The direct head-to-head comparison with human experts provides a concrete performance target. Releasing the data and code is a positive step for reproducibility in this domain.

major comments (2)
  1. [§3 (Data Collection)] §3 (Data Collection): The central claim of a 53% performance gap and the benchmark's utility rest on the editing requests being realistic and representative of expert CAD workflows. The manuscript describes a 'contained study' with ten consenting designers but provides no details on designer selection criteria, years of experience, industry domains, CAD software diversity, model complexity distribution, or quantitative breakdown of edit types (e.g., parametric changes vs. topology edits). This absence directly affects whether the gap generalizes.
  2. [§5 (Evaluation)] §5 (Evaluation): The 53% absolute gap in human acceptance trials is the primary quantitative result. However, the manuscript lacks complete specification of the automatic metrics (definition, computation, and any validation against ground truth), the human trial protocol (number of evaluators per edit, blinding procedures, acceptance criteria, and inter-rater reliability), and how edits were validated as successful. These details are required to interpret and reproduce the reported gap.
minor comments (2)
  1. [Abstract] Abstract: 'GPT 5.2' is referenced without clarification on whether it corresponds to a publicly available model version or internal variant; this should be aligned with the models listed in the experimental section.
  2. [Figures and Tables] Figure captions and tables: Ensure all automatic metric definitions and human evaluation scales are explicitly stated in captions or legends so readers can interpret results without cross-referencing the main text.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments on our manuscript introducing neuralCAD-Edit. We address each major comment point-by-point below, indicating planned revisions where applicable.

read point-by-point responses
  1. Referee: [§3 (Data Collection)] The central claim of a 53% performance gap and the benchmark's utility rest on the editing requests being realistic and representative of expert CAD workflows. The manuscript describes a 'contained study' with ten consenting designers but provides no details on designer selection criteria, years of experience, industry domains, CAD software diversity, model complexity distribution, or quantitative breakdown of edit types (e.g., parametric changes vs. topology edits). This absence directly affects whether the gap generalizes.

    Authors: We agree that providing more context on the data collection process would help readers evaluate the representativeness of the benchmark. In the revised manuscript, we will expand §3 to include a quantitative breakdown of edit types (e.g., counts of parametric changes versus topology edits) and details on the CAD software used, as these were recorded. We will also add a limitations subsection discussing the contained nature of the study and its implications for generalizability. However, detailed information on designer selection criteria, years of experience, and industry domains was not collected to respect participant privacy and consent agreements, so we are unable to provide a full breakdown of these aspects. revision: partial

  2. Referee: [§5 (Evaluation)] The 53% absolute gap in human acceptance trials is the primary quantitative result. However, the manuscript lacks complete specification of the automatic metrics (definition, computation, and any validation against ground truth), the human trial protocol (number of evaluators per edit, blinding procedures, acceptance criteria, and inter-rater reliability), and how edits were validated as successful. These details are required to interpret and reproduce the reported gap.

    Authors: We acknowledge that the evaluation section requires more detailed specification to ensure reproducibility. In the revised manuscript, we will substantially expand §5 with complete definitions and computation procedures for the automatic metrics, including any ground-truth validations performed. We will also fully describe the human evaluation protocol, including the number of evaluators per edit, blinding procedures, specific acceptance criteria, inter-rater reliability statistics (such as Fleiss' kappa), and the validation process for successful edits. All of this information is available from our experimental records and will be documented clearly. revision: yes

standing simulated objections not resolved
  • Detailed designer selection criteria, years of experience, industry domains, CAD software diversity, and model complexity distribution, because these specifics were not systematically recorded during the contained study.

Circularity Check

0 steps flagged

No circularity: empirical benchmark via new data collection

full rationale

The paper introduces neuralCAD-Edit as a benchmark constructed from multimodal editing requests captured from ten professional CAD designers, then measures foundation model performance against human experts using automatic metrics and human acceptance trials. No derivations, equations, fitted parameters, or predictions appear in the provided text. Central claims rest on direct empirical comparison to external human baselines rather than any self-referential reduction, self-citation chain, or ansatz. The work is self-contained data collection and evaluation with no load-bearing steps that collapse to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the collected multimodal requests and the reliability of human evaluation rather than mathematical derivations or new postulated entities.

axioms (1)
  • domain assumption Human evaluators can reliably judge the quality and acceptability of CAD model edits
    Invoked to support the human acceptance trials that establish the performance gap.

pith-pipeline@v0.9.0 · 5460 in / 1166 out tokens · 46634 ms · 2026-05-10T09:07:28.219051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    Cadquery.https://github.com/CadQuery/cadquery

  2. [2]

    Gemini api documentation.https://ai.google.dev/gemini-api/docs/video- understanding#transcribe-video, accessed 1-Feb-2026

  3. [3]

    In: International Conference on Computer Vision (2025)

    Agrawal,H.,Schoop,E.,Pan,X.,Mahajan,A.,Seff,A.,Feng,D.,Cheng,R.,Teran, A.R.M.Y., Gomez, E., Sundararajan, A., Huang, F., Swearngin, A., Moorthy, M., Nichols, J., Toshev, A.: UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents. In: International Conference on Computer Vision (2025)

  4. [4]

    arXiv preprint arXiv:2406.00144 (2024)

    Badagabettu, A., Yarlagadda, S.S., Farimani, A.B.: Query2cad: Generating cad models using natural language queries. arXiv preprint arXiv:2406.00144 (2024)

  5. [5]

    INTERSPEECH 2023 (2023)

    Bain, M., Huh, J., Han, T., Zisserman, A.: Whisperx: Time-accurate speech tran- scription of long-form audio. INTERSPEECH 2023 (2023)

  6. [6]

    In: Computer Vision and Pattern Recogntion (2023)

    Black, M.J., Patel, P., Tesch, J., Yang, J.: Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Computer Vision and Pattern Recogntion (2023)

  7. [7]

    In: Computer Vision and Pattern Recogntion (2023)

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Computer Vision and Pattern Recogntion (2023)

  8. [8]

    In: Computer Vision and Pattern Recogntion (2017)

    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Computer Vision and Pattern Recogntion (2017)

  9. [9]

    Chandrasegaran, S., Ramanujan, D., Elmqvist, N.: How do sketching and non- sketching actions convey design intent? In: Proceedings of the 2018 Designing In- teractive Systems Conference (2018)

  10. [10]

    In: Computer Vision and Pattern Recogntion (2025)

    Chen, C., Wei, J., Chen, T., Zhang, C., Yang, X., Zhang, S., Yang, B., Foo, C.S., Lin, G., Huang, Q., Liu, F.: Cadcrafter: Generating computer-aided design models from unconstrained images. In: Computer Vision and Pattern Recogntion (2025)

  11. [11]

    SAM 3D: 3Dfy Anything in Images

    Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: SAM 3D: 3Dfy Anything in Images. arXiv preprint arXiv:2511.16624 (2025)

  12. [12]

    arXiv preprint arXiv:2210.11427 (2022)

    Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based seman- tic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)

  13. [13]

    Perrett et al

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y., Patel, A., Yatskar, M., Callison-Burch, C., Head, A., Hen- drix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, ...

  14. [14]

    In: Computer Vision and Pattern Recogntion (2009)

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recogntion (2009)

  15. [15]

    In: Computer Vision and Pattern Recogntion (2025)

    Gu, Y., Huang, I., Je, J., Yang, G., Guibas, L.: Blendergym: benchmarking foun- dational model systems for graphics editing. In: Computer Vision and Pattern Recogntion (2025)

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  17. [17]

    Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K.: Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 (2023)

  18. [18]

    In: Computer Vision and Pattern Recogntion (2017)

    Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elemen- tary visual reasoning. In: Computer Vision and Pattern Recogntion (2017)

  19. [19]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

  20. [20]

    In: Advances in Neural Information Processing Systems (2024)

    Khan, M.S., Sinha, S., Sheikh, T.U., Stricker, D., Ali, S.A., Afzal, M.Z.: Text2cad: Generating sequential cad designs from beginner-to-expert level text prompts. In: Advances in Neural Information Processing Systems (2024)

  21. [21]

    In: International Conference on Computer Vision (2023)

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: International Conference on Computer Vision (2023)

  22. [22]

    In: Computer Vision and Pattern Recogntion (2019)

    Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A., Burnaev, E., Alexa, M., Zorin, D., Panozzo, D.: Abc: A big cad model dataset for geometric deep learning. In: Computer Vision and Pattern Recogntion (2019)

  23. [23]

    Journal of chiropractic medicine15(2), 155–163 (2016)

    Koo, T.K., Li, M.Y.: A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine15(2), 155–163 (2016)

  24. [24]

    Proceedings of the IEEE86(11), 2278–2324 (2002)

    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE86(11), 2278–2324 (2002)

  25. [25]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (2025)

    Lee, M., Zhang, D., Jambon, C., Kim, Y.M.: Brepdiff: Single-stage b-rep diffusion model. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (2025)

  26. [26]

    In: International Conference on Computer Vision (2025)

    Li, H., Erkoc, Z., Li, L., Sirigatti, D., Rosov, V., Dai, A., Nießner, M.: Meshpad: Interactive sketch-conditioned artist-reminiscent mesh generation and editing. In: International Conference on Computer Vision (2025)

  27. [27]

    In: Computer Vision and Pattern Recogntion (2025)

    Li, J., Ma, W., Li, X., Lou, Y., Zhou, G., Zhou, X.: Cad-llama: Leveraging large language models for computer-aided design parametric 3d model generation. In: Computer Vision and Pattern Recogntion (2025)

  28. [28]

    ACM Transactions on Graphics (TOG)44(6), 1–18 (2025)

    Li, P., Zhang, W., Quan, W., Zhang, B., Wonka, P., Yan, D.: Brepgpt: Autore- gressive b-rep generation with voronoi half-patch. ACM Transactions on Graphics (TOG)44(6), 1–18 (2025)

  29. [29]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) neuralCAD-Edit 17

    Li, Y., Zou, Z.X., Liu, Z., Wang, D., Liang, Y., Yu, Z., Liu, X., Guo, Y.C., Liang, D., Ouyang, W., Cao, Y.P.: Triposg: High-fidelity 3d shape synthesis using large- scale rectified flow models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) neuralCAD-Edit 17

  30. [30]

    In: Computer Vision and Pattern Recogntion (2025)

    Li, Y., Lin, C., Liu, Y., Long, X., Zhang, C., Wang, N., Li, X., Wang, W., Guo, X.: Caddreamer: Cad object generation from single-view images. In: Computer Vision and Pattern Recogntion (2025)

  31. [31]

    arXiv preprint arXiv:2508.10201 (2025)

    Liu, Y., Dutt, N.S., Li, C., Mitra, N.J.: B-repler: Semantic b-rep latent editor using large language models. arXiv preprint arXiv:2508.10201 (2025)

  32. [32]

    In: Computer Vision and Pattern Recogntion (2021)

    Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: Computer Vision and Pattern Recogntion (2021)

  33. [33]

    In: Computer Vision and Pattern Recogntion (2024)

    Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Sil- wal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, S., Rajeswaran, A.: Openeqa: Embodied ques- tion answering in the era of foundation m...

  34. [34]

    In: International Conference on Computer Vision (2025)

    Mallis, D., Karadeniz, A.S., Cavada, S., Rukhovich, D., Foteinopoulou, N., Cherenkova, K., Kacem, A., Aouada, D.: Cad-assistant: tool-augmented vllms as generic cad task solvers. In: International Conference on Computer Vision (2025)

  35. [35]

    In: Advances in Neural Information Processing Systems Datasets and Benchmarks Track (2025)

    Man, B., Nehme, G., Alam, M.F., Ahmed, F.: Videocad: A dataset and model for learning long-horizon 3d cad ui interactions from video. In: Advances in Neural Information Processing Systems Datasets and Benchmarks Track (2025)

  36. [36]

    In: Findings of the Association for Computational Linguistics: EMNLP 2025 (2025)

    McCarthy, W.P., Vaduguru, S., Willis, K.D., Matejka, J., Fan, J.E., Fried, D., Pu, Y.: mrcad: Multimodal communication to refine computer-aided designs. In: Findings of the Association for Computational Linguistics: EMNLP 2025 (2025)

  37. [37]

    In: Eu- ropean Conference on Computer Vision (2020)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: Eu- ropean Conference on Computer Vision (2020)

  38. [38]

    In: Advances in Neural Information Processing Systems Datasets and Bench- marks Track (2021)

    Mu, T., Ling, Z., Xiang, F., Yang, D., Li, X., Tao, S., Huang, Z., Jia, Z., Su, H.: Maniskill: Generalizable manipulation skill benchmark with large-scale demonstra- tions. In: Advances in Neural Information Processing Systems Datasets and Bench- marks Track (2021)

  39. [39]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer- nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual fe...

  40. [40]

    arXiv preprint arXiv:2509.00269 (2025)

    Parelli, M., Oechsle, M., Niemeyer, M., Tombari, F., Geiger, A.: 3d-latte: Latent space 3d editing from textual instructions. arXiv preprint arXiv:2509.00269 (2025)

  41. [41]

    Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek

    Patwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., Fishman, S.P., Aljubeh, M., Thacker, P., Fauconnet, L., Kim, N.S., Chao, P., Miserendino, S., Chabot, G., Li, D., Sharman, M., Barr, A., Glaese, A., Tworek, J.: Gdpval: Evaluating ai model performance on real-world economically valuable tasks. arXiv preprint arXiv:2510.04374 (2025)

  42. [42]

    In: Computer Vision and Pattern Recogntion (2025)

    Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., Chalk, J., Zhu, Z., Guerrier, R., Abdelazim, F., Zhu, B., Moltisanti, D., Wray, M., Doughty, H., Damen, D.: Hd-epic: A highly- detailed egocentric video dataset. In: Computer Vision and Pattern Recogntion (2025)

  43. [43]

    Humanity's Last Exam

    Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C.B.C., Shaaban, M., Ling, J., Shi, S., et al.: Humanity’s last exam. arXiv preprint arXiv:2501.14249 (2025) 18 T. Perrett et al

  44. [44]

    In: Advances in Neural Information Processing Systems (2023)

    Pătrăucean, V., Smaira, L., Gupta, A., Continente, A.R., Markeeva, L., Banarse, D., Koppula, S., Heyward, J., Malinowski, M., Yang, Y., Doersch, C., Matejovi- cova, T., Sulsky, Y., Miech, A., Frechette, A., Klimczak, H., Koster, R., Zhang, J., Winkler, S., Aytar, Y., Osindero, S., Damen, D., Zisserman, A., Carreira, J.: Per- ception test: A diagnostic ben...

  45. [45]

    In: International Conference on Machine Learning (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)

  46. [46]

    In: International Conference on Machine Learning (2023)

    Radford,A.,Kim,J.W.,Xu,T.,Brockman,G.,McLeavey,C.,Sutskever,I.:Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning (2023)

  47. [47]

    In: Advances in Neural Information Processing Systems (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., Jitsev, J.: Laion-5b: An open large-scale dataset for training next generation image-text models. In: Advances in Neural Information Processing Systems (2022)

  48. [48]

    In: Computer Vision and Pattern Recogntion (2022)

    Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and language models for visio-linguistic composition- ality. In: Computer Vision and Pattern Recogntion (2022)

  49. [49]

    In: Advances in Neural Information Processing Systems (2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

  50. [50]

    Computer Aided Geometric Design111, 102327 (2024)

    Wang, H., Zhao, M., Wang, Y., Quan, W., Yan, D.M.: Vq-cad: Computer-aided de- sign model generation with vector quantized diffusion. Computer Aided Geometric Design111, 102327 (2024)

  51. [51]

    arXiv preprint arXiv:2501.19054 (2025)

    Wang, R., Yuan, Y., Sun, S., Bian, J.: Text-to-cad generation through infusing visual feedback in large language models. arXiv preprint arXiv:2501.19054 (2025)

  52. [52]

    In: Advances in Neural Information Processing Systems (2024)

    Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., Chen, W.: Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In: Advances in Neural Information Processing Systems (2024)

  53. [53]

    Rensselaer Polytechnic Institute (1986)

    Weiler, K.J.: Topological structures for geometric modeling (Boundary represen- tation, manifold, radial edge structure). Rensselaer Polytechnic Institute (1986)

  54. [54]

    In: DESIGN International Design Conference, Dubrovnik (2002)

    Williams, A., Cowdroy, R.: How designers communicate ideas to each other in design meetings. In: DESIGN International Design Conference, Dubrovnik (2002)

  55. [55]

    In: Computer Vision and Pattern Recogntion (2022)

    Willis, K.D.D., Jayaraman, P.K., Chu, H., Tian, Y., Li, Y., Grandi, D., Sanghi, A., Tran, L., Lambourne, J.G., Solar-Lezama, A., Matusik, W.: Joinable: Learning bottom-up assembly of parametric cad joints. In: Computer Vision and Pattern Recogntion (2022)

  56. [56]

    ACM Transactions on Graphics (TOG)40(4) (2021)

    Willis, K.D.D., Pu, Y., Luo, J., Chu, H., Du, T., Lambourne, J.G., Solar-Lezama, A., Matusik, W.: Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences. ACM Transactions on Graphics (TOG)40(4) (2021)

  57. [57]

    In: European Conference on Computer Vision (2024)

    Wu, S., Khasahmadi, A.H., Katz, M., Jayaraman, P.K., Pu, Y., Willis, K., Liu, B.: Cadvlm: Bridging language and vision in the generation of parametric cad sketches. In: European Conference on Computer Vision (2024)

  58. [58]

    In: Computer Vision and Pattern Recogntion (2025) neuralCAD-Edit 19

    Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: Computer Vision and Pattern Recogntion (2025) neuralCAD-Edit 19

  59. [59]

    Text-to-cadquery: A new paradigm for cad generation with scalable large model capabilities.arXiv preprint arXiv:2505.06507, 2025

    Xie, H., Ju, F.: Text-to-cadquery: A new paradigm for cad generation with scalable large model capabilities. arXiv preprint arXiv:2505.06507 (2025)

  60. [60]

    In: Sigraph Asia (2025)

    Xu, X., Jayaraman, P.K., Lambourne, J.G., Liu, Y., Malpure, D., Meltzer, P.: Autobrep: Autoregressive b-rep generation with unified topology and geometry. In: Sigraph Asia (2025)

  61. [61]

    ACM Transactions on Graphics (TOG)43(4), 1–14 (2024)

    Xu, X., Lambourne, J., Jayaraman, P., Wang, Z., Willis, K., Furukawa, Y.: Brep- gen: A b-rep generative diffusion model with structured latent geometry. ACM Transactions on Graphics (TOG)43(4), 1–14 (2024)

  62. [62]

    In: International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (2025)

    Yu, N., Alam, M.F., Hart, A.J., Ahmed, F.: Gencad-3d: Cad program genera- tion using multimodal latent space alignment and synthetic dataset balancing. In: International Design Engineering Technical Conferences and Computers and Information in Engineering Conference (2025)

  63. [63]

    arXiv preprint arXiv:2502.03997 (2025)

    Yuan, Y., Sun, S., Liu, Q., Bian, J.: Cad-editor: A locate-then-infill framework with automated training data synthesis for text-based cad editing. arXiv preprint arXiv:2502.03997 (2025)

  64. [64]

    In: International Conference on Computer Vision (2021)

    Zhou, L., Du, Y., Wu, J.: 3d shape generation and completion through point-voxel diffusion. In: International Conference on Computer Vision (2021)

  65. [65]

    In: Asian Conference on Machine Learning (2025) A Video examples The videos on the project webpage contain four examples from neuralCAD-Edit

    Zuo, Z., Gan, Y., Long, J., Liu, X.: CAD-HLLM: Generating executable CAD from text with hierarchical LLM planning. In: Asian Conference on Machine Learning (2025) A Video examples The videos on the project webpage contain four examples from neuralCAD-Edit. One request from each of the four modalities is shown. They are described in Section 2.2 of the main...

  66. [66]

    Text, medium For each modality, we show:

  67. [67]

    Request video. 20 T. Perrett et al

  68. [68]

    Logged events are displayed as captions

    Sped up viewport renders from the requestor performing the edit, and the human baseline (who has only seen the request video) performing the edit. Logged events are displayed as captions

  69. [69]

    B Rubrics The human and VLM evals use the same rubrics for rating edits

    Output CAD models from humans and AIs alongside the input model. B Rubrics The human and VLM evals use the same rubrics for rating edits. We collect two scores from each:instruction following, which measures how well the edit conforms to the request, andquality, which measures the quality of the work done. Both are on a 1-7 scale, from worst to best. We l...

  70. [70]

    Makes things worse or goes in completely the wrong direction

  71. [71]

    Makes an attempt, demonstrates a rough understanding of the request but many parts incorrect or incomplete

  72. [72]

    Mostly does what is asked, but with noticeable errors or omissions

  73. [73]

    Does what is asked, with small errors or omissions

  74. [74]

    Perfect - follows the request precisely with no errors or omissions

  75. [75]

    Above and beyond - perfect with helpful extras that a thoughtful expert designer would include. Quality

  76. [76]

    No model or edit produced

  77. [77]

    Very poor - erroneous model, impossible geometry, overlapping geometries, incomplete sketches etc

  78. [78]

    Poor - simplistic, blocky

  79. [79]

    Average - acceptable first pass

  80. [80]

    Good attempt with room for improvement

Showing first 80 references.