pith. machine review for the scientific record. sign in

arxiv: 2604.19609 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D semantic segmentationVolume TransformerTransformer for 3Dvolumetric tokens3D scene understandingmulti-dataset trainingrotary embeddingsinstance segmentation
0
0 comments X

The pith

A minimally adapted vanilla Transformer processes 3D scenes as volumetric tokens and surpasses specialized backbones when trained at scale across datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers dominate other domains but 3D scene understanding still uses custom backbones with built-in priors. This paper adapts the vanilla Transformer encoder to 3D by splitting scenes into volumetric patches, applying global attention, and using 3D rotary embeddings for positions. Naive training fails due to shortcuts on small benchmarks, so they add strong augmentations, regularization, and distillation from a CNN teacher. Scaling to multiple datasets then lets Volt improve more than fixed architectures, reaching top results in semantic segmentation and as a backbone for instance segmentation.

Core claim

The central discovery is that the Volume Transformer, using volumetric patch tokens and full self-attention with 3D rotary positional embeddings, can serve as a general-purpose backbone for 3D scenes. After addressing shortcut learning through a data-efficient training strategy involving augmentations, regularization, and teacher distillation, and further scaling supervision by joint training on multiple indoor and outdoor datasets, it achieves state-of-the-art performance across benchmarks and in instance segmentation pipelines.

What carries the argument

Volumetric patch tokens processed with global self-attention and extended rotary positional embeddings, supported by a training pipeline of 3D augmentations, regularization, and convolutional distillation.

Load-bearing premise

The combination of strong 3D augmentations, regularization, and distillation from a convolutional teacher fully prevents shortcut learning on existing 3D datasets without requiring per-dataset tuning.

What would settle it

A controlled experiment showing that Volt trained only on a single large dataset without the proposed recipe performs worse than current specialized backbones, or that multi-dataset scaling does not yield superior gains compared to domain-specific models.

Figures

Figures reproduced from arXiv: 2604.19609 by Adrian Kruse, Bastian Leibe, Daan de Geus, Kadir Yilmaz, Tristan H\"ofer.

Figure 2
Figure 2. Figure 2: Volt architecture. The input 3D scene is partitioned into non-overlapping volumetric patches, and each patch is embedded into a token with a linear tokenizer. The resulting token sequence is processed by a Transformer encoder with global attention. The latent tokens are then upsampled back to the voxel resolution with a single transposed convolution and mapped to semantic predictions by a linear classifica… view at source ↗
Figure 4
Figure 4. Figure 4: Scaling behaviour comparison of 3D backbones. Semantic segmentation mIoU on ScanNet200 when training with 50% and 100% of ScanNet200, and under multi-dataset joint train￾ing, also using ARKitScenes [71] and ScanNet++ [42]. As supervision increases, Volt scales more favorably than prior domain-specific backbones. the same data as a teacher while training Volt. Note that this is not the typical distillation … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative semantic segmentation results on ScanNet. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative semantic segmentation results on SemanticKITTI. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome this, we introduce a data-efficient training recipe based on strong 3D augmentations, regularization, and distillation from a convolutional teacher, making Volt competitive with state-of-the-art methods. We then scale supervision through joint training on multiple datasets and show that Volt benefits more from increased scale than domain-specific 3D backbones, achieving state-of-the-art results across indoor and outdoor datasets. Finally, when used as a drop-in backbone in a standard 3D instance segmentation pipeline, Volt again sets a new state of the art, highlighting its potential as a simple, scalable, general-purpose backbone for 3D scene understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Volume Transformer (Volt), a minimally modified vanilla Transformer encoder for 3D scene understanding. Input scenes are partitioned into volumetric patch tokens processed with full global self-attention and 3D rotary positional embeddings. Naive training on standard benchmarks leads to shortcut learning due to limited 3D scale; this is addressed via a data-efficient recipe of strong 3D augmentations, regularization, and distillation from a convolutional teacher. Joint training across multiple datasets then demonstrates that Volt benefits more from scale than domain-specific 3D backbones, yielding SOTA semantic segmentation on indoor/outdoor datasets. Volt is also shown to set a new SOTA when used as a drop-in backbone in a standard 3D instance segmentation pipeline.

Significance. If the empirical claims hold under fair comparisons, the work is significant as it provides evidence that a simple, general-purpose vanilla Transformer can match or exceed specialized 3D architectures when properly scaled and trained. This could help integrate 3D scene understanding into the broader Transformer ecosystem, facilitating transfer of scaling techniques, software optimizations, and hardware support. The scaling results, if reproducible and controlled, would be a useful data point on architecture-agnostic benefits of increased supervision in 3D.

major comments (2)
  1. [§4 (scaling experiments) and associated tables] §4 (scaling experiments) and associated tables: The central claim that 'Volt benefits more from increased scale than domain-specific 3D backbones' is load-bearing. The manuscript must explicitly document that the comparison baselines (e.g., MinkowskiNet, Point Transformer variants) received identical joint multi-dataset training, the same strong 3D augmentations, regularization, and distillation from the convolutional teacher. If the recipe was applied only to Volt, the differential scaling advantage cannot be attributed to the vanilla architecture.
  2. [§3.2 (training recipe) and ablation tables] §3.2 (training recipe) and ablation tables: The recipe is presented as overcoming shortcut learning, yet the paper provides limited quantitative ablations isolating the contribution of each component (augmentations vs. regularization vs. distillation). This weakens the ability to attribute success to the transformer backbone rather than the training procedure.
minor comments (2)
  1. [Abstract] Abstract: Claims of SOTA performance are made without any numerical results, dataset names, or baseline deltas; adding 1-2 key metrics would improve readability and impact.
  2. [§2.2] Notation: The 3D extension of rotary embeddings is introduced but the exact formulation (e.g., how angles are computed per axis) should be stated in a single equation for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the scaling experiments and training recipe. We address each major comment below and will revise the manuscript to improve clarity and strengthen the empirical claims.

read point-by-point responses
  1. Referee: [§4 (scaling experiments) and associated tables] §4 (scaling experiments) and associated tables: The central claim that 'Volt benefits more from increased scale than domain-specific 3D backbones' is load-bearing. The manuscript must explicitly document that the comparison baselines (e.g., MinkowskiNet, Point Transformer variants) received identical joint multi-dataset training, the same strong 3D augmentations, regularization, and distillation from the convolutional teacher. If the recipe was applied only to Volt, the differential scaling advantage cannot be attributed to the vanilla architecture.

    Authors: We agree that the differential scaling claim requires controlled, identical training conditions to be fully attributable to the architecture. The current manuscript compares Volt (trained with the full recipe) against baselines using their originally published numbers, which did not employ our joint multi-dataset training, augmentations, regularization, or distillation. To resolve this, we will add new controlled experiments in the revision: we will retrain MinkowskiNet and Point Transformer variants using exactly the same joint training protocol, augmentations, regularization, and teacher distillation as Volt. Updated tables and text in §4 will report these results, allowing a direct apples-to-apples comparison of scaling behavior. revision: yes

  2. Referee: [§3.2 (training recipe) and ablation tables] §3.2 (training recipe) and ablation tables: The recipe is presented as overcoming shortcut learning, yet the paper provides limited quantitative ablations isolating the contribution of each component (augmentations vs. regularization vs. distillation). This weakens the ability to attribute success to the transformer backbone rather than the training procedure.

    Authors: We acknowledge that the existing ablations in §3.2 demonstrate the overall effectiveness of the recipe but do not fully isolate the marginal contribution of each element (strong 3D augmentations, regularization, and distillation). While some component-wise results are present (particularly for distillation), we agree that more granular quantitative breakdowns would strengthen the analysis. In the revision we will expand the ablation tables and accompanying text to include separate runs isolating each component and their combinations, clarifying how much of the performance gain is due to the training procedure versus the vanilla Transformer backbone itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical adaptation and scaling results

full rationale

The paper describes an empirical adaptation of the vanilla Transformer encoder to 3D volumetric patches using rotary embeddings, followed by a training recipe of augmentations, regularization, and teacher distillation to address shortcut learning on limited 3D data. It then reports results from joint multi-dataset training and drop-in use in instance segmentation. No derivation chain, equations, or fitted parameters are presented as predictions; all claims rest on benchmark comparisons. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear. The work is self-contained against external public benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the approach relies on standard Transformer components and common empirical training practices.

pith-pipeline@v0.9.0 · 5579 in / 1173 out tokens · 60283 ms · 2026-05-10T03:30:10.937007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 12 canonical work pages · 9 internal anchors

  1. [1]

    Attention is All You Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” inNeurIPS, 2017

  2. [2]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthineret al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” inICLR, 2021

  3. [3]

    Language Models are Few-Shot Learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- try, A. Askellet al., “Language Models are Few-Shot Learners,” inNeurIPS, 2020

  4. [4]

    Robust Speech Recognition via Large-Scale Weak Supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” inICML, 2023

  5. [5]

    Learning Transferable Visual Models From Natural Language Supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning Transferable Visual Models From Natural Language Supervision,” inICML, 2021

  6. [6]

    RoFormer: Enhanced Transformer with Rotary Position Embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “RoFormer: Enhanced Transformer with Rotary Position Embedding,”Neurocomputing, 2024

  7. [7]

    Masked Autoencoders Are Scalable Vision Learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” inCVPR, 2022

  8. [8]

    Query-Key Normalization for Transformers,

    A. Henry, P. R. Dachapally, S. S. Pawar, and Y . Chen, “Query-Key Normalization for Transformers,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4246–4253

  9. [9]

    NVIDIA Hopper Architecture In-Depth,

    M. Andersch, G. Palmer, R. Krashinsky, N. Stam, V . Mehta, G. Brito, and S. Ramaswamy, “NVIDIA Hopper Architecture In-Depth,” https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/, 2022, NVIDIA Technical Blog. Accessed 2026-02-23

  10. [10]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” inNeurIPS, 2022

  11. [11]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,

    T. Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” inICLR, 2024

  12. [12]

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,

    J. Shah, G. Bikshandi, Y . Zhang, V . Thakkar, P. Ramani, and T. Dao, “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” inNeurIPS, 2024

  13. [13]

    4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks,

    C. Choy, J. Gwak, and S. Savarese, “4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks,” inCVPR, 2019

  14. [14]

    Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation,

    X. Zhu, H. Zhou, T. Wang, F. Hong, Y . Ma, W. Li, H. Li, and D. Lin, “Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation,” inCVPR, 2021

  15. [15]

    Mask3D: Mask Transformer for 3D Semantic Instance Segmentation,

    J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3D: Mask Transformer for 3D Semantic Instance Segmentation,” inICRA, 2023

  16. [16]

    OneFormer3D: One Transformer for Unified Point Cloud Segmentation,

    M. Kolodiazhnyi, A. V orontsova, A. Konushin, and D. Rukhovich, “OneFormer3D: One Transformer for Unified Point Cloud Segmentation,” inCVPR, 2024

  17. [17]

    Mask4Former: Mask Transformer for 4D Panoptic Segmentation,

    K. Yilmaz, J. Schult, A. Nekrasov, and B. Leibe, “Mask4Former: Mask Transformer for 4D Panoptic Segmentation,” inICRA, 2024. 12

  18. [18]

    Point Transformer,

    H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point Transformer,” inICCV, 2021

  19. [19]

    Point Transformer V2: Grouped Vector Attention and Partition-based Pooling,

    X. Wu, Y . Lao, L. Jiang, X. Liu, and H. Zhao, “Point Transformer V2: Grouped Vector Attention and Partition-based Pooling,” inNeurIPS, 2022

  20. [20]

    Point Transformer V3: Simpler, Faster, Stronger,

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point Transformer V3: Simpler, Faster, Stronger,” inCVPR, 2024

  21. [21]

    Swin3D: A Pre- trained Transformer Backbone for 3D Indoor Scene Understanding,

    Y .-Q. Yang, Y .-X. Guo, J.-Y . Xiong, Y . Liu, H. Pan, P.-S. Wang, X. Tong, and B. Guo, “Swin3D: A Pre- trained Transformer Backbone for 3D Indoor Scene Understanding,”arXiv preprint arXiv:2304.06906, 2023

  22. [22]

    Stratified Transformer for 3D Point Cloud Segmentation,

    X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi, and J. Jia, “Stratified Transformer for 3D Point Cloud Segmentation,” inCVPR, 2022

  23. [23]

    Spherical Transformer for LiDAR-Based 3D Recognition,

    X. Lai, Y . Chen, F. Lu, J. Liu, and J. Jia, “Spherical Transformer for LiDAR-Based 3D Recognition,” in CVPR, 2023

  24. [24]

    arXiv preprint arXiv:2512.13689 (2025) 21

    Y . Yue, D. Robert, J. Wang, S. Hong, J. D. Wegner, C. Rupprecht, and K. Schindler, “LitePT: Lighter Yet Stronger Point Transformer,”arXiv preprint arXiv:2512.13689, 2025

  25. [25]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schel- ten, A. Vaughanet al., “The Llama 3 Herd of Models,”arXiv preprint arXiv:2407.21783, 2024

  26. [26]

    Gemma 3 Technical Report

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi`ereet al., “Gemma 3 Technical Report,”arXiv preprint arXiv:2503.19786, 2025

  27. [27]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski, “DINOv3,”arXiv preprint arXiv:2508.10104, 2025

  28. [28]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “SAM 3: Segment Anything with Concepts,”arXiv preprint arXiv:2511.16719, 2025

  29. [29]

    ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes,” inCVPR, 2017

  30. [30]

    nuScenes: A Multimodal Dataset for Autonomous Driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuScenes: A Multimodal Dataset for Autonomous Driving,” inCVPR, 2020

  31. [31]

    On the Relationship between Self-Attention and Convolu- tional Layers,

    J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the Relationship between Self-Attention and Convolu- tional Layers,” inICLR, 2020

  32. [32]

    Do Vision Transformers See Like Convolutional Neural Networks?

    M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, “Do Vision Transformers See Like Convolutional Neural Networks?” inNeurIPS, 2021

  33. [33]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,”arXiv preprint arXiv:2001.08361, 2020

  34. [34]

    ImageNet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” inCVPR, 2009

  35. [35]

    Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,

    C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,” inICCV, 2017

  36. [36]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs,”arXiv preprint arXiv:2111.02114, 2021

  37. [37]

    How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers,

    A. P. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers,”TMLR, 2022

  38. [38]

    Training Data-Efficient Image Transformers & Distillation Through Attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J´egou, “Training Data-Efficient Image Transformers & Distillation Through Attention,” inICML, 2021

  39. [39]

    Mix3D: Out-of-Context Data Augmen- tation for 3D Scenes,

    A. Nekrasov, J. Schult, O. Litany, B. Leibe, and F. Engelmann, “Mix3D: Out-of-Context Data Augmen- tation for 3D Scenes,” in3DV, 2021

  40. [40]

    Rethinking the Inception Architecture for Computer Vision,

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” inCVPR, 2016

  41. [41]

    Deep Networks with Stochastic Depth,

    G. Huang, Y . Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep Networks with Stochastic Depth,” in ECCV, 2016

  42. [42]

    ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes,

    C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes,” inICCV, 2023

  43. [43]

    ARKitScenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,

    G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitScenes - a diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,” inNeurIPS, 2021. 13

  44. [44]

    PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” inCVPR, 2017

  45. [45]

    PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,” inNeurIPS, 2017

  46. [46]

    Pointconv: Deep Convolutional Networks on 3D Point Clouds,

    W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep Convolutional Networks on 3D Point Clouds,” inCVPR, 2019

  47. [47]

    KPConv: Flexible and Deformable Convolution for Point Clouds,

    H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “KPConv: Flexible and Deformable Convolution for Point Clouds,” inCVPR, 2019

  48. [48]

    KPConvX: Modernizing Kernel Point Convolu- tion with Kernel Attention,

    H. Thomas, Y .-H. H. Tsai, T. D. Barfoot, and J. Zhang, “KPConvX: Modernizing Kernel Point Convolu- tion with Kernel Attention,” inCVPR, 2024

  49. [49]

    Deep Learning for 3D Point Clouds: A Survey,

    Y . Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep Learning for 3D Point Clouds: A Survey,”IEEE TPAMI, 2020

  50. [50]

    3D Semantic Segmentation with Submanifold Sparse Convolutional Networks,

    B. Graham, M. Engelcke, and L. Van Der Maaten, “3D Semantic Segmentation with Submanifold Sparse Convolutional Networks,” inCVPR, 2018

  51. [52]

    Spconv: Spatially Sparse Convolution Library,

    SpconvContributors, “Spconv: Spatially Sparse Convolution Library,” https://github.com/traveller59/ spconv, 2022

  52. [53]

    U-Net: Convolutional Networks for Biomedical Image Seg- mentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Seg- mentation,” inMICCAI, 2015

  53. [54]

    OctFormer: Octree-based Transformers for 3D Point Clouds,

    P.-S. Wang, “OctFormer: Octree-based Transformers for 3D Point Clouds,”ACM Transactions on Graph- ics (SIGGRAPH), vol. 42, no. 4, 2023

  54. [55]

    PCT: Point Cloud Transformer,

    M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, “PCT: Point Cloud Transformer,” Computational Visual Media, 2021

  55. [56]

    Masked Autoencoders for Point Cloud Self-supervised Learning,

    Y . Pang, W. Wang, F. E. Tay, W. Liu, Y . Tian, and L. Yuan, “Masked Autoencoders for Point Cloud Self-supervised Learning,” inECCV, 2022

  56. [57]

    Point-BERT: Pre-Training 3D Point Cloud Trans- formers With Masked Point Modeling,

    X. Yu, L. Tang, Y . Rao, T. Huang, J. Zhou, and J. Lu, “Point-BERT: Pre-Training 3D Point Cloud Trans- formers With Masked Point Modeling,” inCVPR, 2022

  57. [58]

    Point2Vec for Self-Supervised Representation Learn- ing on Point Clouds,

    K. Knaebel, J. Schult, A. Hermans, and B. Leibe, “Point2Vec for Self-Supervised Representation Learn- ing on Point Clouds,” inGCPR, 2023

  58. [59]

    3D ShapeNets: A Deep Representation for V olumetric Shapes,

    Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D ShapeNets: A Deep Representation for V olumetric Shapes,” inCVPR, 2015

  59. [60]

    ShapeNet: An Information-Rich 3D Model Repository

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Suet al., “ShapeNet: An Information-Rich 3D Model Repository,”arXiv preprint arXiv:1512.03012, 2015

  60. [61]

    PyTorch Image Models,

    R. Wightman, “PyTorch Image Models,” https://github.com/rwightman/pytorch-image-models, 2019

  61. [62]

    Layer Normalization

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,”arXiv preprint arXiv:1607.06450, 2016

  62. [63]

    EV A-02: A Visual Representation for Neon Genesis,

    Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao, “EV A-02: A Visual Representation for Neon Genesis,”Image and Vision Computing, vol. 149, p. 105171, 2024

  63. [64]

    Rotary Position Embedding for Vision Transformer,

    B. Heo, S. Park, D. Han, and S. Yun, “Rotary Position Embedding for Vision Transformer,” inECCV, 2024

  64. [65]

    Superpoint Transformer for 3D Scene Instance Segmentation,

    J. Sun, C. Qing, J. Tan, and X. Xu, “Superpoint Transformer for 3D Scene Instance Segmentation,” in AAAI, 2023

  65. [66]

    DINOv2: Learning Robust Visual Features without Supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidovet al., “DINOv2: Learning Robust Visual Features without Supervision,”TMLR, 2024

  66. [67]

    Transferring Inductive Biases through Knowledge Distilla- tion,

    S. Abnar, M. Dehghani, and W. Zuidema, “Transferring Inductive Biases through Knowledge Distilla- tion,”arXiv preprint arXiv:2006.00555, 2020

  67. [68]

    DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers,

    X. Chen, Q. Cao, Y . Zhong, J. Zhang, S. Gao, and D. Tao, “DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers,” inCVPR, 2022

  68. [69]

    Cumulative Spatial Knowledge Distillation for Vision Transformers,

    B. Zhao, R. Song, and J. Liang, “Cumulative Spatial Knowledge Distillation for Vision Transformers,” in ICCV, 2023

  69. [70]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,”arXiv preprint arXiv:1503.02531, 2015

  70. [71]

    ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding,

    G. Ji, S. Weder, F. Engelmann, M. Pollefeys, and H. Blum, “ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding,” inCVPR, 2025. 14

  71. [72]

    Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training,

    X. Wu, Z. Tian, X. Wen, B. Peng, X. Liu, K. Yu, and H. Zhao, “Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training,” inCVPR, 2024

  72. [73]

    DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation,

    K. Knaebel, K. Yilmaz, D. de Geus, A. Hermans, D. Adrian, T. Linder, and B. Leibe, “DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation,” in3DV, 2026

  73. [74]

    PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies,

    G. Qian, Y . Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem, “PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies,” inNeurIPS, 2022

  74. [75]

    OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation,

    B. Peng, X. Wu, L. Jiang, Y . Chen, H. Zhao, Z. Tian, and J. Jia, “OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation,” inCVPR, 2024

  75. [76]

    The Lov ´asz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,

    M. Berman, A. R. Triki, and M. B. Blaschko, “The Lov ´asz-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” inCVPR, 2018

  76. [77]

    Decoupled Weight Decay Regularization,

    I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” inICLR, 2019

  77. [78]

    Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates,

    L. N. Smith and N. Topin, “Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates,” inArtificial Intelligence and Machine Learning for Multi-Domain Operations Applica- tions, 2019

  78. [79]

    Language-Grounded Indoor 3D Semantic Segmentation in the Wild,

    D. Rozenberszki, O. Litany, and A. Dai, “Language-Grounded Indoor 3D Semantic Segmentation in the Wild,” inECCV, 2022

  79. [80]

    Scalability in Perception for Autonomous Driving: Waymo Open Dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in Perception for Autonomous Driving: Waymo Open Dataset,” inCVPR, 2020

  80. [81]

    SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences,

    J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences,” inICCV, 2019

Showing first 80 references.