pith. machine review for the scientific record. sign in

arxiv: 2604.22899 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Text-Guided Multimodal Unified Industrial Anomaly Detection

Linlin Shen, Shuo Ye, Weicheng Xie, Zewen Li, Zitong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords industrial anomaly detectionmultimodaltext-guidedunified modelRGB-3Dcross-modal alignmentunsuperviseddefect localization
0
0 comments X

The pith

Text semantic guidance enables one model to detect anomalies across multiple industrial classes from RGB and 3D scans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes using text descriptions as semantic guidance to align RGB images with 3D point clouds for spotting defects in factory products. Current unsupervised methods often fail because they lack clear high-level cues for matching the two data types and because they model shapes poorly when converting between modalities. The new approach adds two modules that map features while keeping geometry intact and adapt text features to condition the detection process on the specific object. This setup creates a single model that works for many different classes instead of requiring a separate trained model for each one. If successful, factories could inspect varied items with less customization and higher accuracy in both identifying and locating anomalies.

Core claim

We propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance.

What carries the argument

Text-guided framework with Geometry-Aware Cross-Modal Mapper and Object-Conditioned Textual Feature Adaptor that together resolve cross-modal alignment using semantic priors.

If this is right

  • A single trained model can perform accurate anomaly classification and localization across many different industrial object classes.
  • Text semantic priors improve alignment between RGB and 3D features, reducing the impact of modality-specific ambiguities.
  • Geometric structure is maintained during feature mapping from RGB to 3D, supporting better localization of defects.
  • The unsupervised setting becomes viable for diverse products without per-class retraining or labeled anomaly examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Factories could adapt the system to new product lines simply by providing text descriptions rather than collecting new training data for each class.
  • The same text-conditioning principle might extend to other sensor combinations such as RGB plus thermal or X-ray data.
  • Deployment costs could drop because one model replaces multiple class-specific detectors while maintaining or improving accuracy.

Load-bearing premise

High-level text semantic guidance can resolve ambiguous cross-modal alignment and the two proposed modules will preserve geometry and align features effectively without introducing errors or requiring class-specific tuning.

What would settle it

Removing the text-guided adaptor module and measuring whether classification and localization performance on the MVTec 3D-AD dataset falls below the reported state-of-the-art levels.

Figures

Figures reproduced from arXiv: 2604.22899 by Linlin Shen, Shuo Ye, Weicheng Xie, Zewen Li, Zitong Yu.

Figure 1
Figure 1. Figure 1: Comparison of different anomaly detection settings. Normal samples view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our method. The framework consists of three main stages. (1) Multi-modal Feature Extraction. RGB images, depth maps, and text are view at source ↗
Figure 3
Figure 3. Figure 3: Schematic of the Object-Conditioned Textual Prior Generation. view at source ↗
Figure 5
Figure 5. Figure 5: The impact of β. on the MVTec 3D-AD dataset, as summarized in Table V. Starting from the baseline, the introduction of either GACM or OCTA in conjunction with Layers Pruning leads to a marked improvement in image-level detection. While the baseline exhibits competitive performance in pixel-level metrics, the synergistic integration of all three components yields the most robust results across the board. No… view at source ↗
read the original abstract

Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that existing unsupervised multimodal (RGB-3D) industrial anomaly detection methods suffer from ambiguous cross-modal alignment due to missing high-level semantics and insufficient geometric modeling in RGB-to-3D feature mapping. It proposes a unified text-guided framework consisting of a Geometry-Aware Cross-Modal Mapper to preserve geometry during modality conversion and an Object-Conditioned Textual Feature Adaptor to align features with semantic priors. The framework introduces a unified learning paradigm that breaks the one-model-one-class constraint, allowing a single model to handle diverse classes. Extensive experiments on MVTec 3D-AD and Eyecandies are reported to achieve SOTA performance in unsupervised classification and localization.

Significance. If the modules demonstrably resolve cross-modal ambiguity and geometric preservation while enabling true single-model unification without per-class tuning or new artifacts, the work would advance scalable anomaly detection for industrial inspection by reducing the need for class-specific models and leveraging text priors for better generalization across object types.

major comments (2)
  1. [Abstract] Abstract: The central claim that the Geometry-Aware Cross-Modal Mapper preserves geometric structure during RGB-to-3D conversion is unsupported by any mechanism, equation, or validation; without this, it is impossible to confirm the module addresses the stated geometric modeling limitation rather than introducing new mapping errors.
  2. [Abstract] Abstract: The assertion that the Object-Conditioned Textual Feature Adaptor enables alignment with semantic priors and supports a unified paradigm across classes lacks any ablation, quantitative comparison, or implementation detail showing it avoids implicit class-specific dependencies; this directly undermines the load-bearing claim of breaking the one-model-one-class constraint.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, clarifying the support provided in the full paper while agreeing to strengthen the abstract for better readability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the Geometry-Aware Cross-Modal Mapper preserves geometric structure during RGB-to-3D conversion is unsupported by any mechanism, equation, or validation; without this, it is impossible to confirm the module addresses the stated geometric modeling limitation rather than introducing new mapping errors.

    Authors: We appreciate the referee highlighting the need for clarity in the abstract. The Geometry-Aware Cross-Modal Mapper is fully specified in Section 3.2, including the geometry-preserving mechanism (geometry-aware projection with explicit point-cloud alignment constraints), the associated equations for feature mapping, and validation via ablation studies (Section 4.3) plus qualitative results (Figure 5) showing preserved structure without introduced artifacts. The abstract's brevity omitted these references; we will revise it to concisely note the geometric constraint approach and point to the supporting sections. revision: yes

  2. Referee: [Abstract] Abstract: The assertion that the Object-Conditioned Textual Feature Adaptor enables alignment with semantic priors and supports a unified paradigm across classes lacks any ablation, quantitative comparison, or implementation detail showing it avoids implicit class-specific dependencies; this directly undermines the load-bearing claim of breaking the one-model-one-class constraint.

    Authors: We thank the referee for this observation. Section 3.3 details the Object-Conditioned Textual Feature Adaptor, including its conditioning on object semantics to align features with priors while avoiding per-class parameters. Supporting evidence appears in ablation studies (Section 4.4), multi-class quantitative results (Table 2), and the unified training protocol that trains one model across all classes without per-class fine-tuning. We agree the abstract would benefit from a brief mention of the conditioning mechanism and will update it accordingly, with explicit cross-references to the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on proposed modules and external benchmark results

full rationale

The paper introduces a text-guided multimodal framework consisting of a Geometry-Aware Cross-Modal Mapper and an Object-Conditioned Textual Feature Adaptor to address cross-modal alignment and geometric modeling issues in industrial anomaly detection. It further claims a unified learning paradigm that breaks the one-model-one-class constraint. These are presented as novel contributions, with performance validated through experiments on the independent public datasets MVTec 3D-AD and Eyecandies, achieving reported SOTA results in unsupervised classification and localization. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work are evident in the provided text. The derivation chain is self-contained and relies on external empirical evaluation rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the effectiveness of two newly introduced modules and the assumption that text semantics provide sufficient high-level guidance for cross-modal alignment in unsupervised settings.

axioms (2)
  • domain assumption Text descriptions of objects supply reliable semantic priors that can align RGB and 3D features without labeled anomaly examples
    Invoked to justify the Object-Conditioned Textual Feature Adaptor and overall framework.
  • domain assumption Unsupervised anomaly detection on public benchmarks like MVTec 3D-AD is a valid proxy for real industrial performance
    Standard assumption in the field used to support the SOTA claim.
invented entities (2)
  • Geometry-Aware Cross-Modal Mapper no independent evidence
    purpose: Preserve geometric structure during RGB-to-3D feature mapping
    New module introduced to address insufficient geometric modeling.
  • Object-Conditioned Textual Feature Adaptor no independent evidence
    purpose: Align multimodal features with semantic priors from text
    New module introduced to address ambiguous cross-modal alignment.

pith-pipeline@v0.9.0 · 5472 in / 1537 out tokens · 60953 ms · 2026-05-08T12:38:25.298903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    The mvtec 3d- ad dataset for unsupervised 3d anomaly detection and localization,

    P. Bergmann, X. Jin, D. Sattlegger, and C. Steger, “The mvtec 3d- ad dataset for unsupervised 3d anomaly detection and localization,” in Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. SCITEPRESS-Science and Technology Publications, 2022

  2. [2]

    Deep industrial image anomaly detection: A survey,

    J. Liu, G. Xie, J. Wang, S. Li, C. Wang, F. Zheng, and Y . Jin, “Deep industrial image anomaly detection: A survey,”Machine Intelligence Research, vol. 21, no. 1, pp. 104–135, 2024

  3. [3]

    Back to the feature: classical 3d features are (almost) all you need for 3d anomaly detection,

    E. Horwitz and Y . Hoshen, “Back to the feature: classical 3d features are (almost) all you need for 3d anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2968–2977

  4. [4]

    Multimodal industrial anomaly detection via hybrid fusion,

    Y . Wang, J. Peng, J. Zhang, R. Yi, Y . Wang, and C. Wang, “Multimodal industrial anomaly detection via hybrid fusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8032–8041

  5. [5]

    The eyecandies dataset for unsupervised multimodal anomaly detection and localization,

    L. Bonfiglioli, M. Toschi, D. Silvestri, N. Fioraio, and D. De Gregorio, “The eyecandies dataset for unsupervised multimodal anomaly detection and localization,” inProceedings of the Asian Conference on Computer Vision, 2022, pp. 3586–3602

  6. [6]

    G2sf: Geometry-guided score fusion for multimodal industrial anomaly detection,

    C. Tao, X. Cao, and J. Du, “G2sf: Geometry-guided score fusion for multimodal industrial anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 551–20 560

  7. [7]

    Asymmetric student-teacher networks for industrial anomaly detection,

    M. Rudolph, T. Wehrbein, B. Rosenhahn, and B. Wandt, “Asymmetric student-teacher networks for industrial anomaly detection,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2592–2602

  8. [8]

    Mul- timodal industrial anomaly detection by crossmodal feature mapping,

    A. Costanzino, P. Z. Ramirez, G. Lisanti, and L. Di Stefano, “Mul- timodal industrial anomaly detection by crossmodal feature mapping,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 234–17 243

  9. [9]

    Deep nearest neighbor anomaly detection,

    L. Bergman, N. Cohen, and Y . Hoshen, “Deep nearest neighbor anomaly detection,”arXiv preprint arXiv:2002.10445, 2020

  10. [10]

    Winclip: Zero-/few-shot anomaly classification and segmentation,

    J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “Winclip: Zero-/few-shot anomaly classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 606–19 616

  11. [11]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,

    W. Ma, X. Zhang, Q. Yao, F. Tang, C. Wu, Y . Li, R. Yan, Z. Jiang, and S. K. Zhou, “Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 4744–4754

  12. [12]

    Ada- clip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,

    Y . Cao, J. Zhang, L. Frittoli, Y . Cheng, W. Shen, and G. Boracchi, “Ada- clip: Adapting clip with hybrid learnable prompts for zero-shot anomaly detection,” inEuropean conference on computer vision. Springer, 2024, pp. 55–72

  13. [13]

    Improv- ing unsupervised defect segmentation by applying structural similarity to autoencoders,

    P. Bergmann, S. L ¨owe, M. Fauser, D. Sattlegger, and C. Steger, “Improv- ing unsupervised defect segmentation by applying structural similarity to autoencoders,”arXiv preprint arXiv:1807.02011, 2018. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 12

  14. [14]

    Divide- and-assemble: Learning block-wise memory for unsupervised anomaly detection,

    J. Hou, Y . Zhang, Q. Zhong, D. Xie, S. Pu, and H. Zhou, “Divide- and-assemble: Learning block-wise memory for unsupervised anomaly detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8791–8800

  15. [15]

    Inpainting transformer for anomaly detection,

    J. Pirnay and K. Chai, “Inpainting transformer for anomaly detection,” in International Conference on Image Analysis and Processing. Springer, 2022, pp. 394–406

  16. [16]

    Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise,

    J. Wyatt, A. Leach, S. M. Schmon, and C. G. Willcocks, “Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2022, pp. 650–656

  17. [17]

    Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings,

    P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4183–4192

  18. [18]

    Padim: a patch dis- tribution modeling framework for anomaly detection and localization,

    T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch dis- tribution modeling framework for anomaly detection and localization,” inInternational Conference on Pattern Recognition. Springer, 2021, pp. 475–489

  19. [19]

    Towards total recall in industrial anomaly detection,

    K. Roth, L. Pemula, J. Zepeda, B. Sch ¨olkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 318–14 328

  20. [20]

    Self-supervised feature adaptation for 3d industrial anomaly detection,

    Y . Tu, B. Zhang, L. Liu, Y . Li, J. Zhang, Y . Wang, C. Wang, and C. Zhao, “Self-supervised feature adaptation for 3d industrial anomaly detection,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 75– 91

  21. [21]

    Bridgenet: A unified multimodal framework for bridging 2d and 3d industrial anomaly detection,

    A. Xiang, Z. Huang, X. Gao, K. Ye, and C.-z. Xu, “Bridgenet: A unified multimodal framework for bridging 2d and 3d industrial anomaly detection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 1579–1587

  22. [22]

    Easynet: An easy network for 3d industrial anomaly detection,

    R. Chen, G. Xie, J. Liu, J. Wang, Z. Luo, J. Wang, and F. Zheng, “Easynet: An easy network for 3d industrial anomaly detection,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7038–7046

  23. [23]

    A unified model for multi-class anomaly detection,

    Z. You, L. Cui, Y . Shen, K. Yang, X. Lu, Y . Zheng, and X. Le, “A unified model for multi-class anomaly detection,”Advances in Neural Information Processing Systems, vol. 35, pp. 4571–4584, 2022

  24. [24]

    Moead: A parameter-efficient model for multi-class anomaly detection,

    S. Meng, W. Meng, Q. Zhou, S. Li, W. Hou, and S. He, “Moead: A parameter-efficient model for multi-class anomaly detection,” in European Conference on Computer Vision. Springer, 2024, pp. 345– 361

  25. [25]

    Anomalygpt: Detecting industrial anomalies using large vision-language models,

    Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang, “Anomalygpt: Detecting industrial anomalies using large vision-language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 1932–1940

  26. [26]

    Iad-gpt: Advancing visual knowledge in multimodal large language model for industrial anomaly detection,

    Z. Li, Z. Yu, Q. Ye, W. Xie, W. Zhuo, and L. Shen, “Iad-gpt: Advancing visual knowledge in multimodal large language model for industrial anomaly detection,”IEEE Transactions on Instrumentation and Measurement, vol. 74, pp. 1–12, 2025

  27. [27]

    A memory and retrieval transformer- based unsupervised learning model for anomaly detection and segmen- tation,

    J. Guo, G. Song, and Y . Wang, “A memory and retrieval transformer- based unsupervised learning model for anomaly detection and segmen- tation,”Pattern Recognition, p. 113004, 2025

  28. [28]

    One-shot unsupervised industrial anomaly detection: Enhanced performance under extreme data scarcity,

    J. Zhou, W. Wong, and F. Liao, “One-shot unsupervised industrial anomaly detection: Enhanced performance under extreme data scarcity,” Pattern Recognition, p. 112759, 2025

  29. [29]

    Fast point feature histograms (fpfh) for 3d registration,

    R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (fpfh) for 3d registration,” in2009 IEEE International Conference on Robotics and Automation. IEEE, 2009, pp. 3212–3217

  30. [30]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660

  31. [31]

    Masked autoencoders for 3d point cloud self-supervised learning,

    Y . Pang, E. H. F. Tay, L. Yuan, and Z. Chen, “Masked autoencoders for 3d point cloud self-supervised learning,”World Scientific Annual Review of Artificial Intelligence, vol. 1, p. 2440001, 2023

  32. [32]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  33. [33]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PmLR, 2021, pp. 8748–8763

  34. [34]

    Clip-ad: A language-guided staged dual-path model for zero- shot anomaly detection,

    X. Chen, J. Zhang, G. Tian, H. He, W. Zhang, Y . Wang, C. Wang, and Y . Liu, “Clip-ad: A language-guided staged dual-path model for zero- shot anomaly detection,” inInternational Joint Conference on Artificial Intelligence. Springer, 2024, pp. 17–33

  35. [35]

    Dyc-clip: Dynamic context-aware multi-modal prompt learning for zero-shot anomaly detection,

    P. Chen, F. Huang, and C. Huang, “Dyc-clip: Dynamic context-aware multi-modal prompt learning for zero-shot anomaly detection,”Pattern Recognition, p. 113215, 2026

  36. [36]

    Generalizing clip prompts for zero-shot anomaly detection,

    D. Kim, C. Park, S. Cho, H. Lim, M. Kang, J. Lee, and S. Lee, “Generalizing clip prompts for zero-shot anomaly detection,”Pattern Recognition, p. 113406, 2026

  37. [37]

    Towards fine-grained vision-language alignment for few-shot anomaly detection,

    Y . Fan, J. Liu, X. Chen, B.-B. Gao, J. Li, Y . Liu, J. Peng, and C. Wang, “Towards fine-grained vision-language alignment for few-shot anomaly detection,”Pattern Recognition, p. 113316, 2026

  38. [38]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,”Advances in Neural Information Processing Systems, vol. 30, 2017

  39. [39]

    Aligning point cloud views using persistent feature histograms

    R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz, “Aligning point cloud views using persistent feature histograms.” inIROS, vol. 1, 2008, p. 7

  40. [40]

    Mambaad: Exploring state space models for multi- class unsupervised anomaly detection,

    H. He, Y . Bai, J. Zhang, Q. He, H. Chen, Z. Gan, C. Wang, X. Li, G. Tian, and L. Xie, “Mambaad: Exploring state space models for multi- class unsupervised anomaly detection,”Advances in Neural Information Processing Systems, vol. 37, pp. 71 162–71 187, 2024

  41. [41]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  42. [42]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 2009, pp. 248–255

  43. [43]

    ShapeNet: An Information-Rich 3D Model Repository

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Suet al., “Shapenet: An information- rich 3d model repository,”arXiv preprint arXiv:1512.03012, 2015