pith. sign in

arxiv: 2512.03532 · v3 · submitted 2025-12-03 · 💻 cs.CV

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Pith reviewed 2026-05-17 02:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary 3D instance segmentationvisual-spatial trackermesh-freemulti-modal large language modelcross-view consistencyDINO featuresRGB-D streamgeneralization
0
0 comments X

The pith

OpenTrack3D uses an online visual-spatial tracker on lifted 2D masks to generate consistent 3D instance proposals without meshes or dataset-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a generalizable method for open-vocabulary 3D instance segmentation that works in mesh-free, unstructured environments by building object proposals online from an RGB-D stream. It starts with a 2D open-vocabulary segmenter to create masks, lifts them to 3D point clouds using depth information, extracts mask-guided instance features with DINO, and uses a novel tracker to fuse visual and spatial cues for maintaining consistency across views. Replacing CLIP with a multi-modal large language model improves handling of complex compositional queries. A reader would care because this enables accurate 3D scene understanding for applications like robotics and AR/VR where traditional mesh-based or trained proposal methods fail to generalize.

Core claim

OpenTrack3D constructs cross-view consistent object proposals online using a visual-spatial tracker that fuses visual and spatial cues from depth-lifted 2D open-vocabulary masks and DINO feature maps. The core pipeline operates entirely without meshes, though an optional superpoints refinement is available when meshes exist. Substituting a multi-modal large language model for CLIP enhances compositional reasoning on user queries, leading to state-of-the-art results on benchmarks including ScanNet200, Replica, ScanNet++, and SceneFun3D.

What carries the argument

The visual-spatial tracker, which maintains instance consistency by fusing visual features from DINO and spatial cues from the 3D point clouds lifted from 2D masks.

If this is right

  • The framework applies to diverse unstructured scenes without requiring scene meshes or retraining proposal networks.
  • It supports complex natural language queries through improved reasoning with the multi-modal LLM.
  • Performance reaches state-of-the-art levels while maintaining strong generalization to novel scenes.
  • Optional mesh-based refinement further boosts accuracy when a scene mesh is provided.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the tracker reliably handles occlusions and motion, it could extend to dynamic real-time 3D tracking in robotics.
  • Connecting the lifted proposals to other 3D tasks like navigation or manipulation becomes feasible without additional training.
  • Testing on outdoor or highly cluttered environments would reveal the limits of the current 2D segmenter and depth lifting assumptions.

Load-bearing premise

That lifting 2D masks with depth and fusing DINO visual features with spatial cues in the tracker will produce reliable instance consistency across multiple views in varied unstructured scenes.

What would settle it

A set of RGB-D sequences from a new unstructured scene where the tracker frequently merges or splits the same physical objects across frames, leading to incorrect 3D instances.

Figures

Figures reproduced from arXiv: 2512.03532 by Chunjie Wang, Siyuan Wei, Xiao Liu, Xiaosheng Yan, Zengran Wang, Zhishan Zhou.

Figure 1
Figure 1. Figure 1: Qualitative and quantitative results of our method. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our Proposal Generation and Proposal Refinement stages. Proposal Generation employs a 2D open-vocabulary segmenter to generate object masks, which are lifted to 3D point clouds using depth maps and subsequently denoised. Mask-guided instance features are extracted from DINO feature maps, and our tracker fuses both visual and spatial cues to maintain instance consistency across frames. Proposal … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of CLIP and MLLM on fine-grained, com [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparative Visualization of Open-Vocabulary 3D Instance Segmentation on ScanNet200. Red masks denote query-correlated [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results on ScanNet++ (left) and SceneFun3D (right). We demonstrate instance segmentation results for various [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces OpenTrack3D for open-vocabulary 3D instance segmentation in mesh-free, unstructured environments. It generates 2D masks via an open-vocabulary segmenter, lifts them to 3D point clouds with depth, extracts mask-guided DINO features, and fuses visual and spatial cues in a novel visual-spatial tracker to produce cross-view consistent proposals online. An optional superpoint refinement module is included when meshes are available, and CLIP is replaced by an MLLM to improve handling of compositional queries. The work claims SOTA results and strong generalization on ScanNet200, Replica, ScanNet++, and SceneFun3D.

Significance. If the results hold, the mesh-free design and online tracker would meaningfully advance OV-3DIS for robotics and AR/VR by removing reliance on dataset-specific proposals or precomputed meshes. Replacing CLIP with an MLLM for better compositional reasoning is a clear strength, and the optional refinement path allows fair comparison with mesh-based methods.

major comments (2)
  1. [§3.2] §3.2 (Visual-Spatial Tracker): The fusion of DINO visual features with spatial cues is described at a high level but does not specify the weighting scheme, update rules, or handling of depth noise/occlusions; this mechanism is load-bearing for the central claim of reliable cross-view consistency without mesh or per-dataset training.
  2. [§4.3] §4.3 (Experiments on SceneFun3D): The generalization claims to unstructured scenes rest on quantitative tables, yet no ablation isolating the tracker's contribution (e.g., tracker vs. naive depth-lifted masks) or error bars across runs is reported; this weakens the evidence that the novel component drives the reported gains.
minor comments (2)
  1. [§2] §2 (Related Work): A brief discussion of recent MLLM-based 3D methods (post-2023) would better contextualize the CLIP replacement choice.
  2. [Figure 4] Figure 4: The qualitative visualizations would benefit from explicit callouts showing where the tracker resolves identity switches that naive lifting fails on.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We are pleased that the referee recognizes the potential impact of our mesh-free design and the use of MLLM for compositional reasoning. We address the major comments below and plan to revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Visual-Spatial Tracker): The fusion of DINO visual features with spatial cues is described at a high level but does not specify the weighting scheme, update rules, or handling of depth noise/occlusions; this mechanism is load-bearing for the central claim of reliable cross-view consistency without mesh or per-dataset training.

    Authors: We agree that the description in §3.2 is somewhat high-level and will provide more technical details in the revision. Specifically, we will elaborate on the weighting scheme used to fuse DINO visual features with spatial cues (e.g., a learned or fixed combination based on similarity scores), the update rules for the tracker (including how proposals are matched and updated across frames), and our strategies for handling depth noise and occlusions, such as using temporal consistency checks and outlier rejection. These additions will strengthen the explanation of how cross-view consistency is achieved in mesh-free settings. revision: yes

  2. Referee: [§4.3] §4.3 (Experiments on SceneFun3D): The generalization claims to unstructured scenes rest on quantitative tables, yet no ablation isolating the tracker's contribution (e.g., tracker vs. naive depth-lifted masks) or error bars across runs is reported; this weakens the evidence that the novel component drives the reported gains.

    Authors: We appreciate this suggestion for strengthening the experimental validation. In the revised manuscript, we will include an ablation study on SceneFun3D that isolates the contribution of the visual-spatial tracker by comparing it against a naive baseline that uses depth-lifted masks without tracking. Additionally, we will report error bars or standard deviations from multiple runs to provide a more robust assessment of the results and confirm that the gains are attributable to the proposed tracker. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline assembles independent off-the-shelf components

full rationale

The paper presents OpenTrack3D as an assembly of existing modules (2D open-vocabulary segmenter, depth lifting, DINO features, MLLM replacement for CLIP, and a described visual-spatial tracker) without any equations, fitted parameters, or self-referential definitions that reduce the claimed performance or consistency to quantities defined by the authors' own choices. No load-bearing step is shown to collapse by construction to prior self-citations or ansatzes; the tracker is introduced as a new fusion mechanism rather than derived from the paper's own inputs. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on the generalization ability of an off-the-shelf 2D open-vocabulary segmenter and DINO features when lifted to 3D; no explicit free parameters, axioms, or new physical entities are named in the abstract.

pith-pipeline@v0.9.0 · 5592 in / 1314 out tokens · 47886 ms · 2026-05-17T02:56:06.598907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaCeFormer delivers 11.1 zero-shot mAP on ScanNet200 (2.8x prior proposal-free best) and runs 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines by using spatial window attention and Morton-curve seriali...

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.arXiv preprint arXiv:2406.02548, 2024

    Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan. Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.arXiv preprint arXiv:2406.02548, 2024. 2, 5

  2. [2]

    Clip2scene: Towards label-efficient 3d scene understanding by clip

    Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, and Wen- ping Wang. Clip2scene: Towards label-efficient 3d scene understanding by clip. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7020–7030, 2023. 1, 2

  3. [3]

    Yolo-world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16911, 2024. 2, 6, 1

  4. [4]

    Gsnerf: Generalizable semantic neural radiance fields with enhanced 3d scene understanding

    Zi-Ting Chou, Sheng-Yu Huang, I Liu, Yu-Chiang Frank Wang, et al. Gsnerf: Generalizable semantic neural radiance fields with enhanced 3d scene understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20806–20815, 2024. 1

  5. [5]

    Functionality understanding and segmentation in 3d scenes

    Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, and Fabio Poiesi. Functionality understanding and segmentation in 3d scenes. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24550– 24559, 2025. 2, 7

  6. [6]

    Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017. 2, 6, 1

  7. [7]

    Scenefun3d: Fine-grained functionality and affordance un- derstanding in 3d scenes

    Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. Scenefun3d: Fine-grained functionality and affordance un- derstanding in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14531–14542, 2024. 2, 6, 7, 1

  8. [8]

    Lowis3d: Language-driven open-world instance-level 3d scene understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence,

    Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Lowis3d: Language-driven open-world instance-level 3d scene understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence,

  9. [9]

    Pla: Language-driven open- vocabulary 3d scene understanding

    Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open- vocabulary 3d scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7010–7019, 2023. 1, 2

  10. [10]

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Martin Ester, Hans-Peter Kriegel, J ¨org Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd, pages 226–231,

  11. [11]

    Efficient graph-based image segmentation.International journal of computer vision, 59:167–181, 2004

    Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation.International journal of computer vision, 59:167–181, 2004. 4

  12. [12]

    Sam-guided graph cut for 3d instance segmentation

    Haoyu Guo, He Zhu, Sida Peng, Yuang Wang, Yujun Shen, Ruizhen Hu, and Xiaowei Zhou. Sam-guided graph cut for 3d instance segmentation. InEuropean Conference on Com- puter Vision, pages 234–251. Springer, 2024. 1, 2, 5

  13. [13]

    Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels

    Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, and Francis Engel- mann. Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. InEuropean Confer- ence on Computer Vision, pages 278–295. Springer, 2024. 1, 2, 7

  14. [14]

    Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation

    Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, and Joan Lasenby. Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation. InEuropean Conference on Computer Vision, pages 169–185. Springer,

  15. [15]

    Details matter for indoor open- vocabulary 3d instance segmentation

    Sanghun Jung, Jingjing Zheng, Ke Zhang, Nan Qiao, Al- bert YC Chen, Lu Xia, Chi Liu, Yuyin Sun, Xiao Zeng, Hsiang-Wei Huang, et al. Details matter for indoor open- vocabulary 3d instance segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9627–9637, 2025. 2, 5, 8, 1

  16. [16]

    Lerf: Language embedded radiance fields

    Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 19729–19739,

  17. [17]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 1

  18. [18]

    Oneformer3d: One transformer for unified point cloud segmentation

    Maxim Kolodiazhnyi, Anna V orontsova, Anton Konushin, and Danila Rukhovich. Oneformer3d: One transformer for unified point cloud segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20943–20953, 2024. 1

  19. [19]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 1

  20. [20]

    Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marry- ing dino with grounded pre-training for open-set object de- tection, 2024. 1, 2

  21. [21]

    Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data

    Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boular- ias, and Kostas Bekris. Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data. InConference on Robot Learning, pages 1610–1620. PMLR, 2023. 2, 5

  22. [22]

    Isbnet: a 3d point cloud instance segmentation network with instance- aware sampling and box-aware dynamic convolution

    Tuan Duc Ngo, Binh-Son Hua, and Khoi Nguyen. Isbnet: a 3d point cloud instance segmentation network with instance- aware sampling and box-aware dynamic convolution. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13550–13559, 2023. 1

  23. [23]

    Open3dis: Open-vocabulary 3d instance segmentation with 9 2d mask guidance

    Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 9 2d mask guidance. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4018–4028, 2024. 2, 5, 7, 1

  24. [24]

    Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking

    Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, and Khoi Nguyen. Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3636–3645,

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 1

  26. [26]

    Oa-cnns: Omni- adaptive sparse cnns for 3d semantic segmentation

    Bohao Peng, Xiaoyang Wu, Li Jiang, Yukang Chen, Heng- shuang Zhao, Zhuotao Tian, and Jiaya Jia. Oa-cnns: Omni- adaptive sparse cnns for 3d semantic segmentation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21305–21315, 2024. 1

  27. [27]

    Openscene: 3d scene understanding with open vocabularies

    Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 815–824, 2023. 1, 2, 5

  28. [28]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2

  29. [29]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 1, 2, 6

  30. [30]

    Language- grounded indoor 3d semantic segmentation in the wild

    David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European conference on computer vision, pages 125–141. Springer, 2022. 1

  31. [31]

    org/10.48550/arXiv.2210.03105

    Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans- former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022. 1, 5

  32. [32]

    Spherical mask: Coarse-to- fine 3d point cloud instance segmentation with spherical rep- resentation

    Sangyun Shin, Kaichen Zhou, Madhu Vankadari, Andrew Markham, and Niki Trigoni. Spherical mask: Coarse-to- fine 3d point cloud instance segmentation with spherical rep- resentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4060– 4069, 2024. 1

  33. [33]

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. S...

  34. [34]

    Alpha- clip: A clip model focusing on wherever you want

    Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha- clip: A clip model focusing on wherever you want. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13019–13029, 2024. 2

  35. [35]

    Openmask3d: Open-vocabulary 3d instance segmenta- tion,

    Ayc ¸a Takmaz, Elisabetta Fedele, Robert W Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Open- mask3d: Open-vocabulary 3d instance segmentation.arXiv preprint arXiv:2306.13631, 2023. 2, 5, 7

  36. [36]

    Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes

    Suhani V ora, Noha Radwan, Klaus Greff, Henning Meyer, Kyle Genova, Mehdi SM Sajjadi, Etienne Pot, Andrea Tagliasacchi, and Daniel Duckworth. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes. arXiv preprint arXiv:2111.13260, 2021. 1

  37. [37]

    Softgroup for 3d instance segmentation on point clouds

    Thang Vu, Kookhoi Kim, Tung M Luu, Thanh Nguyen, and Chang D Yoo. Softgroup for 3d instance segmentation on point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708– 2717, 2022. 1

  38. [38]

    Maskclus- tering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation

    Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclus- tering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 28274–28284, 2024. 2, 5

  39. [39]

    Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding

    Jihan Yang, Runyu Ding, Weipeng Deng, Zhe Wang, and Xi- aojuan Qi. Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19823–19832, 2024. 1, 2

  40. [40]

    Scannet++: A high-fidelity dataset of 3d indoor scenes, 2023

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes, 2023. 2, 6, 1

  41. [41]

    Sai3d: Segment any instance in 3d scenes

    Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3292–3302, 2024. 2, 5

  42. [42]

    Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation

    Jihuai Zhao, Junbao Zhuo, Jiansheng Chen, and Huimin Ma. Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19325–19334, 2025. 2, 5

  43. [43]

    In-place scene labelling and understanding with implicit scene representation

    Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J Davison. In-place scene labelling and understanding with implicit scene representation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838–15847, 2021. 1 10 OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation Suppleme...

  44. [44]

    We deploy Qwen3-4B using vLLM [19] on a single A100 GPU and use the default gen- eration settings unless otherwise noted

    More Implementation Details In this section, we provide additional implementation de- tails of our MLLM components to improve understand- ing and reproducibility. We deploy Qwen3-4B using vLLM [19] on a single A100 GPU and use the default gen- eration settings unless otherwise noted. For the SceneFun3D [7] dataset, the user-provided task descriptions cann...

  45. [45]

    Sensitivity analysis of theτ match parameter on subsets of the ScanNet200 validation set

    Hyperparameter Sensitivity Analysis τmatch AP AP 50 AP25 0.6 26.4 38.3 46.4 0.5 28.2 39.6 47.9 0.4 28.6 40.5 48.8 0.3 27.3 38.2 45.9 Table 7. Sensitivity analysis of theτ match parameter on subsets of the ScanNet200 validation set. τexp AP τvis AP γAP τmerge AP 0.00 28.0 0.1 28.6 0.5 27.6 0.8 27.6 0.03 28.2 0.229.1 0.4 28.6 0.7 28.4 0.0528.6 0.3 27.2 0.32...

  46. [46]

    Most of our runtime is spent on model inference

    Runtime Analysis We analyze the runtime of our framework and compare it with the previous state-of-the-art, DetailMatters [15]. Most of our runtime is spent on model inference. The cost of the 2D open-vocabulary segmenter and DINO scales with the number of video frames, while the proposal-classification cost (via either MLLM or CLIP [28]) mainly depends o...