TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

Ao Li; Hang Zhang; Renhe Zhang; Xin Tan; Yuyang Tan

arxiv: 2605.26576 · v1 · pith:PRAK6NIXnew · submitted 2026-05-26 · 💻 cs.CV · cs.LG

TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

Yuyang Tan , Renhe Zhang , Hang Zhang , Ao Li , Xin Tan This is my paper

Pith reviewed 2026-06-29 18:37 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords 3D Gaussian SplattingReferring SegmentationOpen-WorldMulti-View ConsistencyTrack-then-LabelSemantic ConsensusEmbodied AINatural Language Grounding

0 comments

The pith

TrackRef3D achieves open-world referring segmentation in 3D Gaussian Splatting without manual annotation by decoupling object discovery from semantic grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TrackRef3D as a fully automatic pipeline for referring segmentation in 3D Gaussian Splatting scenes using natural language. It introduces a track-then-label paradigm that first discovers objects via multi-view tracking and then grounds them semantically, avoiding per-scene manual labels and multi-view inconsistencies. The Trajectory-Aware Semantic Consensus Module aggregates predictions through clustering and voting to create stable identities. A hybrid training strategy and visibility-aware descriptions handle broad and specific queries. If correct, this would allow language-based 3D object interaction without costly human annotation per environment.

Core claim

TrackRef3D presents a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding, with the Trajectory-Aware Semantic Consensus Module aggregating cross-view predictions via synonymous clustering and trajectory-aware voting and a Hybrid Training Strategy optimizing coarse and fine-grained cues.

What carries the argument

The track-then-label paradigm with the Trajectory-Aware Semantic Consensus Module (TSCM), which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity.

Load-bearing premise

The Trajectory-Aware Semantic Consensus Module can reliably aggregate cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity that remains correct under varying query specificities and view inconsistencies.

What would settle it

A scene and set of queries where the module assigns inconsistent semantic identities to the same tracked object across rephrased natural language inputs or additional viewpoints.

Figures

Figures reproduced from arXiv: 2605.26576 by Ao Li, Hang Zhang, Renhe Zhang, Xin Tan, Yuyang Tan.

**Figure 1.** Figure 1: Given only unannotated images and a reconstructed 3D Gaussian field for geometry, we perform the Trajectory-Aware Semantic Consensus Module (TSCM) to obtain canonical semantic identities and multi-granularity descriptions supervised by the Hybrid Training Strategy (HTS). The result is a language-grounded 3D Gaussian referring field that supports short and long queries. conditioned 3D referring segmentation… view at source ↗

**Figure 2.** Figure 2: (a) Previous methods rely on manually annotated category labels for per-view pseudo mask generation and referring descriptions, which are brittle to occlusions and domain-specific vocabulary, leading to multi-view inconsistent supervision. (b) TrackRef3D automatically generates cross-view consistent pseudo masks and multi-granularity descriptions via video tracking with clustering and voting, enabling con… view at source ↗

**Figure 3.** Figure 3: Overview of TrackRef3D. Starting from unannotated cross-view images, we first generate initial masks and labels. The Trajectory-Aware Semantic Consensus Module (TSCM) associates per-view masks into stable trajectories via video tracking and ensures multi-view consistency by establishing a canonical semantic identity through clustering and voting. It further generates descriptions based on the selected keyf… view at source ↗

**Figure 4.** Figure 4: Visualizations on the Ref-LERF dataset of the key components. The blue highlights spatial cues. strates that our automatic pipeline can provide more consistent training signals than per-view pseudo supervision derived from manual annotations. Traditional 2D method Grounded SAM (Ren et al., 2024b) struggles with 3D consistency, while the neural field approach SPIn-NeRF (Mirzaei et al., 2023) achieves onl… view at source ↗

**Figure 5.** Figure 5: Visualizations on the Ref-LERF dataset and our self-collected Laboratory scene. The blue highlights spatial cues [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrackRef3D introduces a track-then-label pipeline with TSCM and HTS modules to automate referring segmentation in 3DGS without manual annotation, but the abstract supplies no results or implementation details to check whether the consistency claims hold.

read the letter

The main point is a new automatic pipeline for open-world referring segmentation in 3D Gaussian Splatting. It decouples object discovery from language grounding by first tracking trajectories across views and then assigning consistent semantic labels, avoiding the per-scene manual masks that prior work needs.

The named pieces that appear new are the Trajectory-Aware Semantic Consensus Module, which clusters synonymous predictions and applies trajectory-aware voting to create a single canonical identity, the visibility-aware description generation step, and the Hybrid Training Strategy that combines coarse category and fine-grained referential signals under a multi-positive contrastive loss. These target the stated problems of multi-view flips and sensitivity to query specificity.

The approach makes sense on paper for embodied AI settings where manual labeling is costly. Using trajectories to stabilize semantics across views is a direct response to the inconsistency issue.

The clear limitation is the complete absence of any experimental evidence. The abstract asserts state-of-the-art results and robustness but shows no datasets, no numbers, no ablations, and no error analysis. Without those, it is impossible to tell whether TSCM actually reduces inconsistency more than standard multi-view fusion or whether the contrastive objective handles query variation as claimed.

This work is aimed at the 3D vision and language grounding community. Readers working on practical 3DGS pipelines would find the high-level structure useful if the full paper contains reproducible experiments.

It deserves a serious referee because the problem is real and the proposed decoupling is structured, even though the current description is high-level. I would send it for review to see the methods and results sections.

Referee Report

0 major / 2 minor

Summary. The manuscript presents TrackRef3D, a fully automatic pipeline for open-world referring segmentation in 3D Gaussian Splatting. It introduces a multi-view consistent track-then-label paradigm that decouples object discovery from semantic grounding, along with the Trajectory-Aware Semantic Consensus Module (TSCM) that uses synonymous clustering and trajectory-aware voting, a visibility-aware description generation strategy, and a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues via a multi-positive contrastive objective. The work claims state-of-the-art performance on benchmarks without requiring manual per-scene annotation.

Significance. If the central claims hold under detailed scrutiny of the methods and results, the work would represent a meaningful contribution to referring 3DGS and embodied AI by removing reliance on expensive manual annotations and addressing multi-view inconsistency and query-specificity generalization. The explicit decoupling of discovery from grounding via tracking is a conceptually clean framing that could influence subsequent pipelines.

minor comments (2)

[Abstract] Abstract: the claim of 'state-of-the-art performance' and 'extensive experiments on benchmarks' is stated without naming the datasets, metrics, baselines, or quantitative margins, which prevents verification of the performance assertions.
[Abstract] Abstract: the description of the Hybrid Training Strategy mentions a 'multi-positive contrastive objective' but provides no equation, loss formulation, or positive/negative sampling details, leaving the training procedure opaque.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of TrackRef3D and for recognizing the potential contribution of the track-then-label paradigm, TSCM, visibility-aware descriptions, and HTS in enabling fully automatic open-world referring segmentation in 3DGS. We are prepared to address any specific concerns that would resolve the current uncertainty in the recommendation.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract presents a methodological pipeline (TrackRef3D with TSCM for cross-view aggregation via clustering/voting and HTS for joint optimization) as a contribution without any equations, fitted parameters, self-citations, or derivation steps that reduce outputs to inputs by construction. No load-bearing claims are shown to be equivalent to their own definitions or prior self-references. The decoupling of discovery from grounding is described at the level of paradigm and modules, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5752 in / 1104 out tokens · 38390 ms · 2026-06-29T18:37:18.157733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Profuse: Effi- cient cross-view context fusion for open-vocabulary 3d gaussian splatting.arXiv preprint arXiv:2601.04754,

Chiou, Y .-J., Cheng, W.-T., and Yang, Y .-F. Profuse: Effi- cient cross-view context fusion for open-vocabulary 3d gaussian splatting.arXiv preprint arXiv:2601.04754,

work page arXiv
[2]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

2019
[3]

CogVLM2: Visual Language Models for Image and Video Understanding

Hong, W., Wang, W., Ding, M., Yu, W., Lv, Q., Wang, Y ., Cheng, Y ., Huang, S., Ji, J., Xue, Z., et al. Cogvlm2: Vi- sual language models for image and video understanding. arXiv preprint arXiv:2408.16500,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Supergseg: Open-vocabulary 3d segmentation with structured super- gaussians.arXiv preprint arXiv:2412.10231,

Liang, S., Wang, S., Li, K., Niemeyer, M., Gasperini, S., Lensch, H., Navab, N., and Tombari, F. Supergseg: Open-vocabulary 3d segmentation with structured super- gaussians.arXiv preprint arXiv:2412.10231,

work page arXiv
[5]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ¨adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V ., Carion, N., Wu, C.-Y ., Girshick, R., Doll ´ar, P., and Feichtenhofer, C. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

and Gurevych, I

Reimers, N. and Gurevych, I. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP). Associa- tion for Computational Linguistics,

2020
[7]

Grounding DINO 1.5: Advance the “edge” of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

Ren, T., Jiang, Q., Liu, S., Zeng, Z., Liu, W., Gao, H., Huang, H., Ma, Z., Jiang, X., Chen, Y ., et al. Grounding dino 1.5: Advance the” edge” of open-set object detection. arXiv preprint arXiv:2405.10300, 2024a. Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y ., Yan, F., et al. Grounded sam: Assembling open-world model...

work page arXiv
[8]

Visibility-aware language aggre- gation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515,

Wang, S., Li, K., Liang, S., Alegret, E., Ma, J., Navab, N., and Gasperini, S. Visibility-aware language aggre- gation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515,

work page arXiv
[9]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

Werby, A., Huang, C., B¨uchner, M., Valada, A., and Bur- gard, W. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Work- shop on Vision-Language Models for Navigation and Manipulation at ICRA 2024,

2024
[10]

Rg-san: Rule-guided spatial awareness network for end-to-end 3d referring expression segmentation.Advances in Neural Information Process- ing Systems, 37:110972–110999, 2024a

Wu, C., Ji, J., Wang, H., Ma, Y ., Huang, Y ., Luo, G., Fei, H., Sun, X., Ji, R., et al. Rg-san: Rule-guided spatial awareness network for end-to-end 3d referring expression segmentation.Advances in Neural Information Process- ing Systems, 37:110972–110999, 2024a. Wu, C., Liu, Y ., Ji, J., Ma, Y ., Wang, H., Luo, G., Ding, H., Sun, X., and Ji, R. 3d-gres:...

work page arXiv

[1] [1]

Profuse: Effi- cient cross-view context fusion for open-vocabulary 3d gaussian splatting.arXiv preprint arXiv:2601.04754,

Chiou, Y .-J., Cheng, W.-T., and Yang, Y .-F. Profuse: Effi- cient cross-view context fusion for open-vocabulary 3d gaussian splatting.arXiv preprint arXiv:2601.04754,

work page arXiv

[2] [2]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

2019

[3] [3]

CogVLM2: Visual Language Models for Image and Video Understanding

Hong, W., Wang, W., Ding, M., Yu, W., Lv, Q., Wang, Y ., Cheng, Y ., Huang, S., Ji, J., Xue, Z., et al. Cogvlm2: Vi- sual language models for image and video understanding. arXiv preprint arXiv:2408.16500,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Supergseg: Open-vocabulary 3d segmentation with structured super- gaussians.arXiv preprint arXiv:2412.10231,

Liang, S., Wang, S., Li, K., Niemeyer, M., Gasperini, S., Lensch, H., Navab, N., and Tombari, F. Supergseg: Open-vocabulary 3d segmentation with structured super- gaussians.arXiv preprint arXiv:2412.10231,

work page arXiv

[5] [5]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ¨adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K. V ., Carion, N., Wu, C.-Y ., Girshick, R., Doll ´ar, P., and Feichtenhofer, C. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

and Gurevych, I

Reimers, N. and Gurevych, I. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP). Associa- tion for Computational Linguistics,

2020

[7] [7]

Grounding DINO 1.5: Advance the “edge” of open-set object detection.arXiv preprint arXiv:2405.10300, 2024

Ren, T., Jiang, Q., Liu, S., Zeng, Z., Liu, W., Gao, H., Huang, H., Ma, Z., Jiang, X., Chen, Y ., et al. Grounding dino 1.5: Advance the” edge” of open-set object detection. arXiv preprint arXiv:2405.10300, 2024a. Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y ., Yan, F., et al. Grounded sam: Assembling open-world model...

work page arXiv

[8] [8]

Visibility-aware language aggre- gation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515,

Wang, S., Li, K., Liang, S., Alegret, E., Ma, J., Navab, N., and Gasperini, S. Visibility-aware language aggre- gation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515,

work page arXiv

[9] [9]

Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation

Werby, A., Huang, C., B¨uchner, M., Valada, A., and Bur- gard, W. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. InFirst Work- shop on Vision-Language Models for Navigation and Manipulation at ICRA 2024,

2024

[10] [10]

Rg-san: Rule-guided spatial awareness network for end-to-end 3d referring expression segmentation.Advances in Neural Information Process- ing Systems, 37:110972–110999, 2024a

Wu, C., Ji, J., Wang, H., Ma, Y ., Huang, Y ., Luo, G., Fei, H., Sun, X., Ji, R., et al. Rg-san: Rule-guided spatial awareness network for end-to-end 3d referring expression segmentation.Advances in Neural Information Process- ing Systems, 37:110972–110999, 2024a. Wu, C., Liu, Y ., Ji, J., Ma, Y ., Wang, H., Luo, G., Ding, H., Sun, X., and Ji, R. 3d-gres:...

work page arXiv