Recognition: unknown
Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis
Pith reviewed 2026-05-10 06:11 UTC · model grok-4.3
The pith
Shape is a self-supervised model that produces dense 3D embeddings from CAD meshes using masked token reconstruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Shape establishes that self-supervised pretraining with masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency allows a transformer-based model to learn embeddings from CAD meshes that generalize well, as evidenced by strong performance on reconstruction and retrieval tasks on held-out data with little overfitting.
What carries the argument
Masked token reconstruction of normalized geometry statistics combined with multi-resolution contrastive consistency on a transformer backbone that includes a multi-scale geometry-aware tokenizer.
Load-bearing premise
That the combination of masked reconstruction and contrastive consistency on the chosen CAD datasets produces embeddings generalizing robustly to unseen industrial CAD analysis tasks.
What would settle it
Evaluating the model on a new collection of industrial CAD meshes from a different source and observing significantly lower reconstruction accuracy or retrieval performance than reported would falsify the generalization claim.
Figures
read the original abstract
Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self-supervised foundation model converting surface meshes into dense per-token embeddings. Shape combines a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer processor using grouped-query attention and RMSNorm. A learned reconstruction prior enables per-region attribution for explainable predictions. Pretraining uses masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. The 10.9M-parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held-out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top-1 retrieval under the Wang-Isola protocol, with near-zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target-space normalization shows per-dimension normalization is critical: without it, performance collapses (R2 < 0.14, top-1 < 88%); with it, both losses succeed (R2 > 0.70, top-1 > 96%). Smooth-L1 offers secondary stability. Code, embeddings, and an interactive demo are released at https://github.com/simd-ai/shape.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Shape, a self-supervised 3D geometry foundation model that converts surface meshes into dense per-token embeddings using a structured 3D latent grid, the MAGNO tokenizer with cross-attention, and a transformer processor. Pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360 via masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. On a held-out split of 2,983 meshes, it achieves reconstruction R² = 0.729 and 98.1% top-1 retrieval, with an ablation showing per-dimension normalization is critical for performance.
Significance. If the embeddings generalize beyond the proxy tasks, Shape could serve as a useful foundation model for industrial CAD analysis, supported by the released code, embeddings, and the 2x2 ablation isolating the role of normalization. The near-zero train/val gap and concrete metrics strengthen the internal validity of the pretraining approach.
major comments (1)
- [Abstract] The positioning of Shape as supporting 'industrial CAD workflows' and 'industrial CAD analysis' relies on the assumption that the learned embeddings will transfer to real downstream tasks. However, all reported results are limited to reconstruction R² and retrieval accuracy on held-out data from the pretraining datasets, with no evaluations on actual industrial tasks such as manufacturing feature classification, part segmentation, or assembly reasoning. This is load-bearing for the central claim of the manuscript.
minor comments (1)
- [Abstract] The reconstruction metric is written as 'R2'; it should be formatted as R² for mathematical clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive review, the recognition of the model's internal validity, the near-zero train/val gap, the ablation isolating normalization, and the value of the released code and embeddings. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] The positioning of Shape as supporting 'industrial CAD workflows' and 'industrial CAD analysis' relies on the assumption that the learned embeddings will transfer to real downstream tasks. However, all reported results are limited to reconstruction R² and retrieval accuracy on held-out data from the pretraining datasets, with no evaluations on actual industrial tasks such as manufacturing feature classification, part segmentation, or assembly reasoning. This is load-bearing for the central claim of the manuscript.
Authors: We agree that the abstract's framing assumes the embeddings will prove useful for downstream industrial tasks and that direct evidence on tasks such as manufacturing feature classification, part segmentation, or assembly reasoning would strengthen the central claim. The current results focus on self-supervised pretraining quality via masked normalized geometry reconstruction (R² = 0.729) and multi-resolution contrastive retrieval (98.1% top-1) on held-out meshes drawn from the same industrial CAD sources (Thingi10K, MFCAD, Fusion360). These proxies directly measure geometric fidelity and discriminative power, which are prerequisites for the cited downstream applications; the 2×2 ablation further isolates that per-dimension normalization is essential for both objectives. In the foundation-model literature, such proxy metrics on held-out data are standard for validating the backbone prior to transfer studies. The released embeddings and code are explicitly provided to support exactly those follow-on evaluations. To address the concern without overstating current evidence, we will revise the abstract to state that the model supplies dense per-token embeddings validated on reconstruction and retrieval proxies and is intended as a foundation for industrial CAD analysis, while adding a dedicated limitations paragraph that explicitly notes the absence of downstream task results and the reliance on proxy generalization. revision: partial
Circularity Check
No significant circularity; empirical pretraining and held-out evaluation
full rationale
The paper describes a self-supervised transformer model pretrained via masked reconstruction and contrastive losses on public CAD datasets, with performance reported as R2=0.729 and 98.1% top-1 retrieval on a held-out split of 2,983 meshes plus an empirical 2x2 ablation on normalization. No equations, derivations, or self-citations are presented that reduce any claimed result to fitted parameters or prior author work by construction. All metrics are computed on independent test data under standard protocols, satisfying the criteria for a self-contained empirical result with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- 10.9M backbone parameters
- Per-dimension normalization scales
axioms (2)
- domain assumption Masked token reconstruction of normalized geometry statistics yields useful dense embeddings
- domain assumption Multi-resolution contrastive consistency improves geometric representation quality
invented entities (1)
-
MAGNO tokenizer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Point-BERT: Pre-training 3d point cloud transformers with masked point modeling
Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-BERT: Pre-training 3d point cloud transformers with masked point modeling. InCVPR, 2022
2022
-
[2]
Yatian Pang, Wenxiao Wang, Francis E. H. Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InECCV, 2022. 14
2022
-
[3]
Shizheng Wen, Arsh Kumbhat, Levi Lingsch, Sepehr Mousavi, Yizhou Zhao, Praveen Chandrashekar, and Siddhartha Mishra. Geometry aware operator transformer as an efficient and accurate neural surrogate for PDEs on arbitrary domains. InNeurIPS, 2025.https://arxiv.org/abs/2505.18781
-
[4]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017
2017
-
[5]
GQA: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InEMNLP, 2023
2023
-
[6]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019
2019
-
[7]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2021
work page internal anchor Pith review arXiv 2021
-
[8]
Understanding contrastive representation learning through alignment and uniformity on the hypersphere
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InICML, 2020
2020
-
[9]
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7), 2015
Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7), 2015
2015
-
[10]
BERT: Pre-training of deep bidirectional transformers for language understanding.NAACL, 2019
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.NAACL, 2019
2019
-
[11]
Improving language under- standing by generative pre-training.Technical report, OpenAI, 2018
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under- standing by generative pre-training.Technical report, OpenAI, 2018
2018
-
[12]
An image is worth 16×16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16 words: Transformers for image recognition at scale. InICLR, 2021
2021
-
[13]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022
2022
-
[14]
DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024
2024
-
[15]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020
2020
-
[16]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. InarXiv:2003.04297, 2020
work page internal anchor Pith review arXiv 2003
-
[17]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
Qi, Hao Su, Kaichun Mo, and Leonidas J
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017
2017
-
[19]
Qi, Li Yi, Hao Su, and Leonidas J
Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. InNeurIPS, 2017
2017
-
[20]
Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training
Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In NeurIPS, 2022
2022
-
[21]
Neural Operator: Graph Kernel Network for Partial Differential Equations
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations.arXiv:2003.03485, 2020
work page internal anchor Pith review arXiv 2003
-
[22]
ABC: A big CAD model dataset for geometric deep learning
Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo. ABC: A big CAD model dataset for geometric deep learning. InCVPR, 2019
2019
-
[23]
Qingnan Zhou and Alec Jacobson. Thingi10K: A dataset of 10,000 3d-printing models. arXiv:1605.04797, 2016. 15
-
[24]
Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G
Karl D.D. Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G. Lambourne, Armando Solar- Lezama, and Wojciech Matusik. Fusion 360 gallery: A dataset and environment for programmatic CAD construction from human design sequences. InACM Transactions on Graphics (SIGGRAPH), 2021
2021
-
[25]
Hierarchical CADnet: Learning from b-reps for machining feature recognition
Andrew Colligan, Trevor Robinson, Declan Nolan, Yang Hua, and Wanbin Cao. Hierarchical CADnet: Learning from b-reps for machining feature recognition. InComputer-Aided Design, 2022
2022
-
[26]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023
2023
-
[27]
Chang, Li Yi, Subarna Tripathi, Leonidas J
Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InCVPR, 2019
2019
-
[28]
Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1):73–101, 1964
1964
-
[29]
Fast R-CNN
Ross Girshick. Fast R-CNN. InICCV, 2015
2015
-
[30]
From attribution maps to human-understandable explanations through concept relevance propagation.Nature Machine Intelligence, 2023
Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept relevance propagation.Nature Machine Intelligence, 2023
2023
-
[31]
Kosiorek, Seungjin Choi, and Yee Whye Teh
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InICML, 2019
2019
-
[32]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. A Full training hyperparameters Table 6 reports the complete set of hyperparameters used to train the released Shape (small) model. A compact summary of architecture parameters only is given in the main body (Section 4). B DDP and training infrastructure The implementat...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.