CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding

Guibiao Liao; Jiankun Li; Kanglin Liu; Qing Li; Xiaoqing Ye; Zhenyu Bao

arxiv: 2404.14249 · v2 · pith:7NEGTMP7new · submitted 2024-04-22 · 💻 cs.CV

CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding

Guibiao Liao , Jiankun Li , Zhenyu Bao , Xiaoqing Ye , Qing Li , Kanglin Liu This is my paper

classification 💻 cs.CV

keywords semanticgaussianclipcoherentindoorunderstandingachievingapproach

0 comments

read the original abstract

Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus. Existing methods typically attach high-dimensional CLIP semantic embeddings to 3D Gaussians and leverage view-inconsistent 2D CLIP semantics as Gaussian supervision, resulting in efficiency bottlenecks and deficient 3D semantic consistency. To address these challenges, we present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR). SAC approach exploits the naturally unified semantics within objects to learn compact, yet effective, semantic Gaussian representations, enabling highly efficient rendering (>100 FPS). 3DCR enforces semantic consistency in 2D and 3D domains: In 2D, 3DCR utilizes refined view-consistent semantic outcomes derived from 3DGS to establish cross-view coherence constraints; in 3D, 3DCR encourages features similar among 3D Gaussian primitives associated with the same object, leading to more precise and coherent segmentation results. Extensive experimental results demonstrate that our method remarkably suppresses existing state-of-the-art approaches, achieving mIoU improvements of 21.20% and 13.05% on ScanNet and Replica datasets, respectively, while maintaining real-time rendering speed. Furthermore, our approach exhibits superior performance even with sparse input data, substantiating its robustness.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Segment 3D Gaussians via Launching Virtual Drones
cs.CV 2026-07 unverdicted novelty 7.0

SAGO achieves setup-free interactive 3D Gaussian segmentation by modeling it as an online NBV planning task in a Markov process, delivering sub-second latency and over 50x speedup over prior setup-free methods.
SAD-GS: Learning Reliable 3D Semantic Gaussian Fields via Dynamic Geo-Semantic Anchoring
cs.CV 2026-06 unverdicted novelty 5.0

SAD-GS proposes dynamic geo-semantic anchoring via SAD and GSFL to learn reliable 3D semantic Gaussian fields, reporting best performance on LERF-OVS, 3D-OVS, and Mip-NeRF360 for open-vocabulary localization and segmentation.
LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting
cs.HC 2024-12 unverdicted novelty 5.0

LIVE-GS uses an LLM to predict physical parameters from static Gaussian assets in 10 seconds for physics-aware VR interactions, validated by interviews, baseline comparisons, and user studies.
A Survey on 3D Gaussian Splatting
cs.CV 2024-01 unverdicted novelty 2.0

A survey compiling principles, applications, benchmarks, and challenges of 3D Gaussian Splatting for explicit 3D scene representation.