pith. machine review for the scientific record. sign in

arxiv: 2604.20395 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.RO

Recognition: unknown

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:03 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords open-vocabulary 3D segmentationproposal-free instance segmentationspace-curve transformerMorton curve serializationmulti-view captioningzero-shot 3D segmentationpoint cloud processingreal-time 3D perception
0
0 comments X

The pith

A space-curve transformer performs open-vocabulary 3D instance segmentation directly from point clouds without external proposals or multi-stage pipelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a fast method for open-vocabulary 3D instance segmentation that avoids the slow multi-stage pipelines or external region proposals common in prior work. It presents SpaCeFormer, a transformer that serializes 3D points along Morton curves and applies spatial window attention to maintain coherent features across the scene. A RoPE-enhanced decoder then predicts instance masks straight from learned queries. To train this model, the authors build SpaCeFormer-3M, a dataset of 3 million multi-view-consistent captions over 604K instances, which delivers 21 times higher mask recall than single-view methods. The resulting system runs at 0.14 seconds per scene and reports higher zero-shot accuracy than previous approaches on ScanNet200, ScanNet++, and Replica.

Core claim

SpaCeFormer is a proposal-free space-curve transformer that combines spatial window attention with Morton-curve serialization to extract spatially coherent features from point clouds and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries. When trained on the SpaCeFormer-3M dataset of 3.0M multi-view-consistent captions obtained through mask clustering and VLM captioning, the model reaches 11.1 zero-shot mAP on ScanNet200 (2.8 times the prior best proposal-free result), 22.9 mAP on ScanNet++, and 24.1 mAP on Replica, all while processing each scene in 0.14 seconds.

What carries the argument

The space-curve transformer, which serializes 3D points along Morton curves to enable efficient spatial window attention and employs a rotary position embedding decoder to output instance masks from learned queries without any external proposals.

If this is right

  • Real-time deployment becomes feasible for robotics and AR/VR applications that require open-vocabulary 3D instance segmentation.
  • The approach eliminates dependence on slow aggregation of 2D foundation model outputs or post-processing of fragmented masks.
  • Direct mask prediction from queries removes the need for external region proposals or additional clustering stages.
  • The multi-view captioning pipeline produces training data with substantially higher mask recall than single-view methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Curve-based serialization of 3D points may offer a general way to make transformers efficient on unordered spatial data without quadratic attention costs.
  • Strong results from VLM-generated pseudo-labels suggest the method could scale to much larger collections of unlabeled 3D scans.
  • The reported inference speed opens the possibility of integrating the model into online 3D mapping pipelines that update instance labels frame by frame.

Load-bearing premise

The pseudo-labels created by clustering masks across multiple views and captioning them with a vision-language model must be sufficiently accurate and free of systematic biases for the model to learn useful patterns rather than artifacts.

What would settle it

Retraining the identical model architecture on the same scenes but using single-view captions instead of the multi-view clustered captions, then measuring whether zero-shot mAP on ScanNet200 falls below the prior best proposal-free baseline, would directly test the necessity of the dataset construction step.

Figures

Figures reproduced from arXiv: 2604.20395 by Chris Choy, Chunghyun Park, Jan Kautz, Junha Lee, Minsu Cho.

Figure 1
Figure 1. Figure 1: Accuracy vs. latency on Replica zero-shot open-vocabulary 3D instance segmentation. SpaCeFormer (red star) reaches 24.1 mAP at 0.14 s per scene—Pareto-optimal among methods that use no GT 3D supervision, and 2–3 orders of magnitude faster than multi-stage 2D+3D pipelines while using only 3D input (no 2D RGB-D streams, no external region proposals). SOLE is the only method above us on mAP (24.7), but requir… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset Generation Pipeline. Our pipeline generates high-quality 3D mask-caption pairs by (1) aggregating 2D masks into 3D instances through training-free multi-view clustering and (2) generating diverse, intrinsic-focused captions via structured multi-view VLM prompting. The process leverages large-scale RGB-D and video datasets to produce a massive corpus for open-vocabulary 3D learning. pipeline that pr… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-view Caption Prompt for LLM. To generate high-fidelity captions, we select representative views of a 3D in￾stance and provide them to a VLM in two formats: (a, c) cropped with spatial context and (b, d) background-removed masked views. See Appendix A.7.2 for details. Step 3: Multi-view Captioning. Single-view captioning produces inconsistent descriptions because VLMs describe view-specific appearance… view at source ↗
Figure 4
Figure 4. Figure 4: Office/Lounge Scene Example. A red leather armchair from an office waiting area, demonstrating the dataset’s coverage of furniture in professional settings. Captions capture material properties (leather), spatial context (near table, yellow wall), and functional affordances (seating, relaxation). scenes ( [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Attention Approaches. (Left) Mor￾ton serialized attention showing the serialization path and result￾ing clusters. (Right) Non-overlapping window attention showing improved preservation of local geometric relationships. Quantita￾tively, our window-based approach reduces average within-window pairwise distance by 28.6% compared to Morton attention (visual coherence; see Sec. 4.3). Query Pixel I… view at source ↗
Figure 6
Figure 6. Figure 6: Attention Pattern Comparison. (Left) Morton code at￾tention mask showing uniform block diagonal structure. (Middle) Window attention mask showing variable-sized blocks based on spatial proximity. (Right) Sorted window attention mask where pixels are reordered by window index to recover a block diagonal structure while preserving spatial relationships. Red, green, and purple lines indicate Morton, spatial, … view at source ↗
Figure 7
Figure 7. Figure 7: Attention block diagrams. (Left to right) Attention blocks of ViT, CvT, PointTransformer, and our SpaCeFormer. Each shows vertical flow with residuals after MHSA and FeedForward. LayerNorm, DropPath, and projection shortcuts are omitted for clarity. which introduce beneficial randomness but sacrifice strict spatial locality. We strategically combine spatial window attention (Space) with Morton code attenti… view at source ↗
Figure 8
Figure 8. Figure 8: RoPE-Enhanced Instance Segmentation Decoder Architecture. The decoder iteratively refines query tokens through cross-attention with scene features and self-attention among queries. RoPE is applied separately to encode 3D spatial relationships in both attention mechanisms using absolute coordinates from the scene. cal for performance (Tab. 8). Parameters are shared across iterations. Prediction Heads. After… view at source ↗
Figure 10
Figure 10. Figure 10: Novel Category Segmentation on Matterport3D. We visualize predictions on categories absent from the ScanNet200 taxonomy. Left: input point cloud. Right: predicted mask (red). The results demonstrate that our model generalizes to novel objects unseen during training, such as a toy and a Christmas tree. conditions. Space-Curve Transformer consistently outper￾forms PTv3 across all four benchmarks (e.g., 16.5… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Results on ScanNet200. We visualize the 3D masks predicted by our model. The results demonstrate the ability to segment a wide range of objects, including furniture like sofa chairs, mini fridges, monitors, and office chairs. Backbone Comparison. We compare our Space-Curve Transformer backbone against PTv3 under identical training Snoopy X-mas [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Attention normalization variants. (Left to right) Default, PreNorm, and PostNorm attention block architectures. Default applies LayerNorm (LN) before branching into attention/MLP, with residuals from normalized values. PreNorm applies LN inline before operations, with residuals from original values. PostNorm applies LN after operations and residual additions. Residual connections are shown with skip paths… view at source ↗
Figure 12
Figure 12. Figure 12: SWIN Windowing on a 2D Grid. (Left) Non-overlapping windows aligned to the grid. (Right) Non-overlapping windows shifted by half the window size, illustrating SWIN’s shifted windowing. where pi = (xi , yi , zi) and pj = (xj , yj , zj ) are the 3D coordinates of voxels vi and vj in window W. Lower values indicate better spatial coherence. A.5.2. WINDOW ATTENTION IMPLEMENTATION We operate spatial window att… view at source ↗
Figure 13
Figure 13. Figure 13: Bedroom Scene Examples. Traditional bedroom furniture including a nightstand, pillow, ornate table, and desk chair. Captions highlight fine-grained details such as material texture (woven fabric, wood grain), decorative elements (brass hardware, carved details, checkered pattern), and spatial arrangements (near bed, adjacent to desk). A.9. Detailed Training Objectives In this section, we provide the detai… view at source ↗
Figure 14
Figure 14. Figure 14: Retail/Bedroom Scene Examples. Mixed-use space with storage and display elements including a woven basket, wall-mounted shelf, curtains, and storage shelf. Captions capture material composition (wicker, composite, linen, wood), functional purposes (storage, display, privacy), and aesthetic qualities (rustic, minimalist, calming atmosphere). We then maximize the similarity of matched pairs by minimizing th… view at source ↗
Figure 15
Figure 15. Figure 15: Additional Qualitative Results (Scene 0462). We show 3D mask predictions for the Copy Room scene, demonstrating segmentation of office equipment and furniture. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional Qualitative Results (Scene 0019). We show 3D mask predictions for the Kitchenette scene, highlighting the ability to segment small objects like cups and bowls. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional Qualitative Results (Scene 0474). We show 3D mask predictions for the Office scene, including computer peripherals and furniture. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Additional Novel Category Segmentation on Matterport3D (1/2). We evaluate on categories absent from the ScanNet200 taxonomy. The left column shows the input point cloud, and the right column shows the predicted mask highlighted in red. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Additional Novel Category Segmentation on Matterport3D (2/2). We evaluate on categories absent from the ScanNet200 taxonomy. The left column shows the input point cloud, and the right column shows the predicted mask highlighted in red. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Missed Ground Truth Visualizations. We visualize examples where the network predictions failed to achieve an IoU > 0.5. These cases include challenging instances such as partially occluded desks or geometrically complex bookshelves. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Confusion matrix of the validation set predictions. The values are row-normalized (recall). The model achieves high accuracy on distinct large objects (e.g., bed, wall, floor) but shows some confusion between geometrically similar categories. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Top 100 most confused prediction pairs. The bar length represents the confusion probability, while the number inside each bar indicates the absolute count of failure cases. The y-axis labels show the Ground Truth → Prediction class assignments. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
read the original abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SpaCeFormer, a proposal-free space-curve transformer for open-vocabulary 3D instance segmentation. It runs at 0.14 seconds per scene using spatial window attention, Morton-curve serialization, and a RoPE-enhanced decoder to predict instance masks directly from learned queries. The authors introduce the SpaCeFormer-3M dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) generated via multi-view mask clustering and VLM captioning, claiming 21x higher mask recall (54.3% vs. 2.5% at IoU > 0.5) than prior single-view pipelines. They report 11.1 zero-shot mAP on ScanNet200 (2.8x over prior best proposal-free method), 22.9 mAP on ScanNet++, and 24.1 mAP on Replica, surpassing prior methods including multi-view 2D approaches.

Significance. If the results hold after validation of the pseudo-label pipeline, the work would be significant for enabling real-time open-vocabulary 3D perception in robotics and AR/VR, offering 2-3 orders of magnitude faster inference than multi-stage 2D+3D pipelines while improving accuracy. The large-scale SpaCeFormer-3M dataset could become a valuable resource for training and benchmarking future open-vocabulary methods. The architectural choices (Morton-curve serialization combined with window attention) represent a practical adaptation of transformer techniques to 3D point clouds.

major comments (2)
  1. [Abstract / Dataset Construction] Abstract and dataset construction section: The headline zero-shot mAP gains (11.1 on ScanNet200, 22.9/24.1 on ScanNet++/Replica) and the 2.8x improvement claim rest on training with SpaCeFormer-3M pseudo-labels. While 54.3% recall at IoU > 0.5 is stated, no precision, caption fidelity (e.g., agreement with ground-truth labels), or cross-view semantic consistency metrics are reported. Without these, it remains possible that performance exploits systematic label noise or view-specific artifacts rather than demonstrating genuine open-vocabulary generalization; this is load-bearing for all empirical claims.
  2. [Experimental Evaluation] Experimental evaluation section: The abstract and results claim superiority over proposal-free and multi-view 2D baselines, but no ablations isolate the contribution of the space-curve transformer components (spatial window attention, Morton serialization, RoPE decoder) from the quality of the multi-view VLM pseudo-labels. This makes it difficult to attribute gains to the proposed architecture versus the new training data.
minor comments (1)
  1. [Abstract] The abstract refers to SpaCeFormer-3M as 'the largest' dataset but provides no direct size or diversity comparison table against prior open-vocabulary 3D datasets beyond the single recall number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract / Dataset Construction] Abstract and dataset construction section: The headline zero-shot mAP gains (11.1 on ScanNet200, 22.9/24.1 on ScanNet++/Replica) and the 2.8x improvement claim rest on training with SpaCeFormer-3M pseudo-labels. While 54.3% recall at IoU > 0.5 is stated, no precision, caption fidelity (e.g., agreement with ground-truth labels), or cross-view semantic consistency metrics are reported. Without these, it remains possible that performance exploits systematic label noise or view-specific artifacts rather than demonstrating genuine open-vocabulary generalization; this is load-bearing for all empirical claims.

    Authors: We agree that additional quantitative validation of the pseudo-label pipeline would strengthen the claims. The original submission emphasized mask recall (54.3% vs. 2.5%) as the key indicator of multi-view improvement over single-view methods. To directly address concerns about label noise or view-specific artifacts, we will add precision metrics, caption fidelity scores (measured against ground-truth labels on overlapping scenes), and cross-view semantic consistency statistics to the revised dataset construction section. We note that the reported zero-shot results on ScanNet200 (and the other held-out benchmarks) were obtained on scenes and annotations not used during pseudo-label generation, which provides evidence that the gains reflect genuine open-vocabulary generalization rather than exploitation of training artifacts. revision: yes

  2. Referee: [Experimental Evaluation] Experimental evaluation section: The abstract and results claim superiority over proposal-free and multi-view 2D baselines, but no ablations isolate the contribution of the space-curve transformer components (spatial window attention, Morton serialization, RoPE decoder) from the quality of the multi-view VLM pseudo-labels. This makes it difficult to attribute gains to the proposed architecture versus the new training data.

    Authors: We acknowledge that explicit ablations separating architectural choices from data quality would improve attribution. The manuscript already compares SpaCeFormer against prior proposal-free methods trained on their respective (smaller) datasets, and the large margins (2.8x on ScanNet200) are consistent with the combined benefit of the space-curve design and the higher-quality multi-view labels. In the revision we will add targeted ablations, including (i) training the proposed architecture on prior pseudo-label sets and (ii) retraining a strong baseline architecture on SpaCeFormer-3M, to the extent the prior label sets are publicly available. These experiments will clarify the individual contributions while preserving the core efficiency claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on independent benchmarks

full rationale

The paper's core claims consist of an empirical architecture (space-curve transformer with Morton serialization and RoPE decoder) trained on a newly constructed pseudo-labeled dataset (SpaCeFormer-3M) and evaluated via standard zero-shot mAP metrics on held-out benchmarks (ScanNet200, ScanNet++, Replica). No equations, predictions, or uniqueness theorems are presented that reduce outputs to inputs by construction. Dataset construction via multi-view clustering and VLM captioning is an input-generation step whose quality is externally verifiable against ground truth, not a self-referential fit. No self-citations are invoked as load-bearing mathematical facts forbidding alternatives. The reported speed and accuracy numbers are direct measurements, not derived quantities forced by the training procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on standard transformer assumptions and the quality of external VLM captions.

pith-pipeline@v0.9.0 · 5586 in / 1241 out tokens · 30546 ms · 2026-05-10T00:03:06.892691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages · 4 internal anchors

  1. [1]

    Matterport3d: Learning from rgb-d data in indoor environments

    Chang, A., Dai, A., Funkhouser, T., Halber, M., Niebner, M., Savva, M., Song, S., Zeng, A., and Zhang, Y . Matterport3d: Learning from rgb-d data in indoor environments. In7th IEEE International Conference on 3D Vision, 3DV 2017, pp. 667–676. Institute of Electrical and Electronics Engineers Inc.,

  2. [2]

    arXiv preprint arXiv:2507.23134 (2025) 9, 12, 13

    Chen, Y ., Guo, X., Chen, W., and Wang, Y . Details matter: Accu- rate 3d open-vocabulary instance segmentation.arXiv preprint arXiv:2507.23134,

  3. [3]

    NimbleReg: A light-weight deep-learning framework for diffeomorphic image registration

    Koch, S., Navarro, M., Avetisyan, A., and Dai, A. Opensplat3d: Open-vocabulary 3d instance segmentation with gaussian splat- ting.arXiv preprint arXiv:2503.07768,

  4. [4]

    M., and Mian, A

    Nguyen, K., de Silva Edirimuni, D., Hassan, G. M., and Mian, A. Retrieving objects from 3d scenes with box- guided open-vocabulary instance segmentation.arXiv preprint arXiv:2512.19088,

  5. [5]

    Oord, A. v. d., Li, Y ., and Vinyals, O. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

  6. [6]

    High quality entity segmentation

    Qi, L., Kuen, J., Shen, T., Gu, J., Li, W., Guo, W., Jia, J., Lin, Z., and Yang, M.-H. High quality entity segmentation. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4024–4033. IEEE,

  7. [7]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y ., Yan, F., et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

  8. [8]

    Straub, J., Whelan, T., Ma, L., Chen, Y ., Wijmans, E., Green, S., Engel, J. J., Mur-Artal, R., Ren, C., Verber, S., Clarkson, J., Yan, Q., Wang, S., Alcantarilla, P., Cabezas, I., Chapin, L., De Nardi, R., Frank, B., Golber, O., Goldman, D., Haenel, P., Kendall, A., Leon, S., Lovegrove, S., Lv, C., Mudrazija, N., Peris, R., Rennie, S., Restrepo, L., Rodr...

  9. [9]

    OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

    Zhou, Z., Wei, S., Wang, Z., Wang, C., Yan, X., and Liu, X. Open- track3d: Towards accurate and generalizable open-vocabulary 3d instance segmentation.arXiv preprint arXiv:2512.03532,

  10. [10]

    (Right) Non-overlapping windows shifted by half the window size, illustrating SWIN’s shifted windowing

    Figure 12.SWIN Windowing on a 2D Grid.(Left) Non-overlapping windows aligned to the grid. (Right) Non-overlapping windows shifted by half the window size, illustrating SWIN’s shifted windowing. where pi = (x i, yi, zi) and pj = (x j, yj, zj) are the 3D coordinates of voxels vi and vj in window W. Lower values indicate better spatial coherence. A.5.2. WIND...

  11. [11]

    views” of the same subject. Your job is to synthesize these views into one cohesive description that focuses on the subject’s intrinsic properties. Note that this “object

    to generate intrinsic and spatially consistent captions: 21 SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation Full System Prompt You are an expert image-captioning assistant. You will be given multiple “views” of the same subject. Your job is to synthesize these views into one cohesive description that focuses on the subject’s intri...

  12. [12]

    mAPtail SpaCeFormer (Default) 7.78 14.24 18.93 10.58 7.14 5.28 SpaCeFormer (PreNorm) 7.60 13.73 18.57 10.62 6.47 5.42 SpaCeFormer (PostNorm) 6.23 12.34 17.43 9.45 5.21 3.70 A.11.1

    Method mAP mAP 50 mAP25 mAPhead mAPcom. mAPtail SpaCeFormer (Default) 7.78 14.24 18.93 10.58 7.14 5.28 SpaCeFormer (PreNorm) 7.60 13.73 18.57 10.62 6.47 5.42 SpaCeFormer (PostNorm) 6.23 12.34 17.43 9.45 5.21 3.70 A.11.1. SAMPLINGSTRATEGY Table 11 analyzes different key sampling strategies for the decoder. We find thatRandomsampling outperforms Full sampli...

  13. [13]

    27 SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation Table 14.Validation mIoU (%) on ScanNet20, ScanNet++, ScanNet200, and Matterport3D

    The diagonal dominance indicates strong performance across most categories, though some confusion persists between semantically similar classes such as table/desk and cabinet/shelves. 27 SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation Table 14.Validation mIoU (%) on ScanNet20, ScanNet++, ScanNet200, and Matterport3D. We use the sa...