arxiv: 2604.09045 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: unknown

Scene-Agnostic Object-Centric Representation Learning for 3D Gaussian Splatting

Tsuheng Hsu , Guiyu Liu , Juho Kannala , Janne Heikkil\"a

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian Splattingobject-centric learningscene-agnostic codebookunsupervised object masks3D scene understandingrepresentation learningGOCL

0 comments

The pith

A pre-trained GOCL module's masks coupled with a scene-agnostic codebook enable direct object supervision in 3D Gaussian Splatting across scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that object representations in 3D Gaussian Splatting can be made scene-agnostic by learning a codebook at the dataset level using a pre-trained Global Object Centric Learning module. This would matter because current methods rely on scene-dependent identities from foundation models that need extra processing to resolve conflicts. The approach directly supervises Gaussian identity features using the codebook and masks, leading to better generalization without per-scene adjustments. Sympathetic readers would see value in structured 3D representations for tasks requiring object-level understanding.

Core claim

We propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting. Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining.

What carries the argument

The scene-agnostic object codebook learned from the pre-trained GOCL module and coupled with its unsupervised masks to supervise 3D Gaussian identities.

If this is right

Enables direct supervision of identity features of 3D Gaussians without additional mask pre-/post-processing or multi-view alignment.
Provides consistent object representations across views and scenes.
Allows object supervision and identification without per-scene fine-tuning or retraining.
Yields more structured representations and better generalization for downstream tasks like robotic interaction and scene understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable object-level manipulations in reconstructed 3D scenes with minimal additional computation.
The codebook might support transferring object knowledge between entirely different environments.
Testing on datasets with more varied object categories could reveal scalability limits not addressed in the paper.

Load-bearing premise

The pre-trained GOCL module produces unsupervised object masks reliable enough to be directly coupled with the codebook for supervising 3D Gaussians across scenes without identity conflicts or further processing.

What would settle it

Demonstrating that object identities assigned by the codebook become inconsistent when the same object appears in a different scene or under different viewing conditions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.09045 by Guiyu Liu, Janne Heikkil\"a, Juho Kannala, Tsuheng Hsu.

**Figure 2.** Figure 2: Qualitative comparison of rendered Gaussian feature masks on the OCTA dataset (left) and GSO dataset (right), where each object [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of cross-scene identity on the OCTA dataset (a) and GSO dataset (b). Our method achieves cross-scene identification [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of rendered Gaussian feature [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Object discovery capability of our method (red boxes). [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Recent works on 3D scene understanding leverage 2D masks from visual foundation models (VFMs) to supervise radiance fields, enabling instance-level 3D segmentation. However, the supervision signals from foundation models are not fundamentally object-centric and often require additional mask pre/post-processing or specialized training and loss design to resolve mask identity conflicts across views. The learned identity of the 3D scene is scene-dependent, limiting generalizability across scenes. Therefore, we propose a dataset-level, object-centric supervision scheme to learn object representations in 3D Gaussian Splatting (3DGS). Building on a pre-trained slot attention-based Global Object Centric Learning (GOCL) module, we learn a scene-agnostic object codebook that provides consistent, identity-anchored representations across views and scenes. By coupling the codebook with the module's unsupervised object masks, we can directly supervise the identity features of 3D Gaussians without additional mask pre-/post-processing or explicit multi-view alignment. The learned scene-agnostic codebook enables object supervision and identification without per-scene fine-tuning or retraining. Our method thus introduces unsupervised object-centric learning (OCL) into 3DGS, yielding more structured representations and better generalization for downstream tasks such as robotic interaction, scene understanding, and cross-scene generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a scene-agnostic codebook on top of a pre-trained slot-attention GOCL module to supervise object identities directly in 3DGS, but the abstract supplies no equations, losses, or numbers so the practical payoff is still unclear.

read the letter

The central move is to replace per-scene mask handling with a dataset-level codebook that stays fixed across scenes. They take an already-trained GOCL module, let it produce unsupervised object masks, and use those masks to anchor identity features on the 3D Gaussians. The claim is that this removes the need for extra alignment steps or per-scene fine-tuning that current VFM-based methods require. That is the concrete novelty relative to the scene-dependent approaches cited in the abstract. It is a reasonable engineering step if the codebook actually stays consistent and the masks do not introduce new identity errors. The paper does well at naming the exact pain points—mask conflicts across views and lack of cross-scene transfer—and at sketching a supervision path that avoids post-processing. Those are real limitations in the prior 3DGS segmentation literature, so the direction is worth testing. The soft spots are straightforward. The abstract contains no loss formulation, no description of how the codebook is updated during 3DGS training, and no quantitative results or ablations. Without those it is impossible to judge whether the direct supervision works or whether the pre-trained masks remain reliable when the 3DGS optimization starts moving the Gaussians. The weakest link is therefore the unexamined assumption that the GOCL masks can be used off-the-shelf for identity supervision without further conflict resolution. This is the kind of paper that belongs in a reading group focused on 3D representation learning or robotics perception. Readers who already work with Gaussian splatting and want to add object-level structure will get the most out of it, but only after they see the full experiments and implementation details. The work is coherent enough on its own terms to deserve a serious referee; the questions it raises are concrete and the high-level architecture is not obviously broken. I would send it to review rather than desk-reject, with the expectation that the authors supply the missing equations, training procedure, and quantitative comparisons.

Referee Report

2 major / 1 minor

Summary. The paper proposes a dataset-level, object-centric supervision scheme for 3D Gaussian Splatting (3DGS) that learns a scene-agnostic object codebook by building on a pre-trained slot-attention-based Global Object Centric Learning (GOCL) module. By coupling this codebook with the module's unsupervised object masks, the method directly supervises identity features of 3D Gaussians without mask pre-/post-processing, multi-view alignment, or per-scene fine-tuning, thereby introducing unsupervised object-centric learning into 3DGS for improved structure and cross-scene generalization in downstream tasks.

Significance. If the central claims hold, the work would provide a practical route to consistent, identity-anchored object supervision in 3DGS that generalizes across scenes without retraining, addressing a clear limitation of current VFM-based mask supervision approaches. The absence of any equations, loss formulations, training details, or quantitative results in the supplied material, however, prevents assessment of whether the method actually delivers the claimed benefits or avoids the identity-conflict issues it seeks to solve.

major comments (2)

Abstract: the description supplies no equations, loss functions, training procedure, or experimental results, so it is impossible to verify whether the data or derivations support the stated claims of direct supervision, elimination of mask processing, and scene-agnostic generalization.
Abstract: the central claim that coupling the codebook with GOCL masks enables reliable cross-scene supervision without identity conflicts rests on the unexamined assumption that the pre-trained GOCL module produces sufficiently consistent and accurate unsupervised masks; no justification or failure-mode analysis is provided.

minor comments (1)

Abstract: the terms 'scene-agnostic object codebook' and 'identity-anchored representations' are introduced without a concrete definition of how the codebook is constructed or how identity consistency is enforced across scenes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address each major comment below with reference to the full manuscript, which contains the technical details absent from the abstract.

read point-by-point responses

Referee: Abstract: the description supplies no equations, loss functions, training procedure, or experimental results, so it is impossible to verify whether the data or derivations support the stated claims of direct supervision, elimination of mask processing, and scene-agnostic generalization.

Authors: The abstract is a concise summary of the core idea. The full manuscript provides the equations for the scene-agnostic object codebook, the loss formulation that directly supervises 3D Gaussian identity features by coupling codebook embeddings with GOCL masks, the dataset-level training procedure without per-scene fine-tuning or mask pre/post-processing, and quantitative results on downstream tasks that demonstrate the claimed benefits in structure and cross-scene generalization. revision: no
Referee: Abstract: the central claim that coupling the codebook with GOCL masks enables reliable cross-scene supervision without identity conflicts rests on the unexamined assumption that the pre-trained GOCL module produces sufficiently consistent and accurate unsupervised masks; no justification or failure-mode analysis is provided.

Authors: The manuscript justifies reliance on the pre-trained GOCL masks through qualitative results showing mask consistency across views and scenes, as well as quantitative comparisons against VFM-based supervision that highlight reduced identity conflicts. Failure-mode analysis, including cases of inaccurate GOCL masks and their impact on supervision, is included in the experiments and discussion sections. revision: no

Circularity Check

0 steps flagged

No significant circularity detected in the provided derivation chain

full rationale

The paper's central approach relies on coupling a pre-trained external GOCL module's unsupervised masks with a learned scene-agnostic codebook to supervise 3D Gaussian identity features. No equations, training objectives, or derivation steps are exposed in the abstract or summary that reduce any prediction or result to a fitted input, self-definition, or self-citation chain by construction. The method explicitly positions the GOCL module as an independent pre-trained component and claims generalization benefits without per-scene retraining, keeping the architecture self-contained against external benchmarks rather than internally tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level proposal of the codebook and reliance on the pre-trained module.

axioms (1)

domain assumption The pre-trained GOCL module yields unsupervised object masks that can be directly used to supervise 3D Gaussian identity features without conflicts or extra processing.
This premise is invoked when the paper states that the codebook can be coupled with the masks to enable direct supervision.

invented entities (1)

scene-agnostic object codebook no independent evidence
purpose: To supply consistent, identity-anchored representations across views and scenes for supervising 3D Gaussians.
The codebook is the central new construct introduced to achieve scene-agnostic supervision.

pith-pipeline@v0.9.0 · 5548 in / 1300 out tokens · 80848 ms · 2026-05-10T17:25:20.252828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields

Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5855–5864,
[2]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3

2021
[4]

Segment any 3d gaussians

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xi- aopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 1971–1979, 2025. 1, 2

1971
[5]

Seg- ment anything in 3d with radiance fields.International Jour- nal of Computer Vision, pages 1–23, 2025

Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Seg- ment anything in 3d with radiance fields.International Jour- nal of Computer Vision, pages 1–23, 2025. 1

2025
[6]

The semantic lifecycle in embodied ai: Acquisition, representation and storage via foundation models.arXiv preprint arXiv:2601.08876, 2026

Shuai Chen, Hao Chen, Yuanchen Bei, Tianyang Zhao, Zhibo Zhou, and Feiran Huang. The semantic lifecycle in embodied ai: Acquisition, representation and storage via foundation models.arXiv preprint arXiv:2601.08876, 2026. 1

work page arXiv 2026
[7]

Unsupervised learning of global object- centric representations for compositional scene understand- ing.IEEE Transactions on Visualization and Computer Graphics, 2025

Tonglin Chen, Yinxuan Huang, Jinghao Huang, Bin Li, and Xiangyang Xue. Unsupervised learning of global object- centric representations for compositional scene understand- ing.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

2025
[8]

Learning global object- centric representations via disentangled slot attention.Ma- chine Learning, 114(2):40, 2025

Tonglin Chen, Yinxuan Huang, Zhimeng Shen, Jinghao Huang, Bin Li, and Xiangyang Xue. Learning global object- centric representations via disentangled slot attention.Ma- chine Learning, 114(2):40, 2025. 2, 3, 6

2025
[9]

Tracking anything with decoupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexan- der Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316–1326, 2023. 1, 2, 8

2023
[10]

Gaussianpro: 3d gaussian splatting with progressive propagation

Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen. Gaussianpro: 3d gaussian splatting with progressive propagation. InForty- first International Conference on Machine Learning, 2024. 2

2024
[11]

Click-gaussian: Interactive segmenta- tion to any 3d gaussians

Seokhun Choi, Hyeonseop Song, Jaechul Kim, Taehyeong Kim, and Hoseok Do. Click-gaussian: Interactive segmenta- tion to any 3d gaussians. InEuropean Conference on Com- puter Vision, pages 289–305. Springer, 2024. 2

2024
[12]

Google scanned objects: A high- quality dataset of 3d scanned household items

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items. In2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560. Ieee, 2022. 5

2022
[13]

Savi++: Towards end-to-end object-centric learning from real-world videos.Advances in Neural Information Processing Systems, 35:28940–28954, 2022

Gamaleldin Elsayed, Aravindh Mahendran, Sjoerd Van Steenkiste, Klaus Greff, Michael C Mozer, and Thomas Kipf. Savi++: Towards end-to-end object-centric learning from real-world videos.Advances in Neural Information Processing Systems, 35:28940–28954, 2022. 3

2022
[14]

Kubric: A scalable dataset generator

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 5

2022
[15]

Octscenes: A versa- tile real-world dataset of tabletop scenes for object-centric learning.arXiv preprint arXiv:2306.09682, 2023

Yinxuan Huang, Tonglin Chen, Zhimeng Shen, Jinghao Huang, Bin Li, and Xiangyang Xue. Octscenes: A versa- tile real-world dataset of tabletop scenes for object-centric learning.arXiv preprint arXiv:2306.09682, 2023. 5

work page arXiv 2023
[16]

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016. 3

work page internal anchor Pith review arXiv 2016
[17]

Improving object- centric learning with query optimization.arXiv preprint arXiv:2210.08990, 2022

Baoxiong Jia, Yu Liu, and Siyuan Huang. Improving object- centric learning with query optimization.arXiv preprint arXiv:2210.08990, 2022. 1, 3

work page arXiv 2022
[18]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
[19]

Garfield: Group anything with radiance fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 1

2024
[20]

Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff

Thomas Kipf, Gamaleldin F Elsayed, Aravindh Mahen- dran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jon- schkowski, Alexey Dosovitskiy, and Klaus Greff. Condi- tional object-centric learning from video.arXiv preprint arXiv:2111.12594, 2021. 3

work page arXiv 2021
[21]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 4015–4026, 2023. 1, 2, 8

2023
[22]

Object-aware gaussian splat- ting for robotic manipulation

Yulong Li and Deepak Pathak. Object-aware gaussian splat- ting for robotic manipulation. InICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024. 1, 2

2024
[23]

Object- centric learning with slot attention.Advances in neural in- formation processing systems, 33:11525–11538, 2020

Francesco Locatello, Dirk Weissenborn, Thomas Un- terthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object- centric learning with slot attention.Advances in neural in- formation processing systems, 33:11525–11538, 2020. 1, 3

2020
[24]

Manigaussian: Dynamic gaus- sian splatting for multi-task robotic manipulation

Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Ji- wen Lu, and Yansong Tang. Manigaussian: Dynamic gaus- sian splatting for multi-task robotic manipulation. InEu- 9 ropean Conference on Computer Vision, pages 349–366. Springer, 2024. 2

2024
[25]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 1, 2, 3

2024
[26]

Unsuper- vised discovery of object-centric neural fields.arXiv preprint arXiv:2402.07376, 2024

Rundong Luo, Hong-Xing Yu, and Jiajun Wu. Unsuper- vised discovery of object-centric neural fields.arXiv preprint arXiv:2402.07376, 2024. 1, 3, 5, 6, 7

work page arXiv 2024
[27]

Temporally consistent object-centric learning by contrasting slots

Anna Manasyan, Maximilian Seitzer, Filip Radovic, Georg Martius, and Andrii Zadaianchuk. Temporally consistent object-centric learning by contrasting slots. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5401–5411, 2025. 3

2025
[28]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 2

2021
[29]

Opensplat3d: Open-vocabulary 3d instance segmentation us- ing gaussian splatting

Jens Piekenbrinck, Christian Schmidt, Alexander Hermans, Narunas Vaskevicius, Timm Linder, and Bastian Leibe. Opensplat3d: Open-vocabulary 3d instance segmentation us- ing gaussian splatting. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 5246–5255,
[30]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 2

2024
[31]

Objective criteria for the evaluation of clustering methods.Journal of the American Statistical as- sociation, 66(336):846–850, 1971

William M Rand. Objective criteria for the evaluation of clustering methods.Journal of the American Statistical as- sociation, 66(336):846–850, 1971. 6

1971
[32]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

arXiv preprint arXiv:2312.17142 , year=

Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Genera- tive 4d gaussian splatting.arXiv preprint arXiv:2312.17142,

work page arXiv
[34]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review arXiv
[35]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 6

2016
[36]

arXiv preprint arXiv:2209.14860 , year=

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Do- minik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Sch¨olkopf, Thomas Brox, et al. Bridging the gap to real-world object-centric learning. arXiv preprint arXiv:2209.14860, 2022. 1

work page arXiv 2022
[37]

Unsupervised discovery and composition of object light fields.arXiv preprint arXiv:2205.03923, 2022

Cameron Smith, Hong-Xing Yu, Sergey Zakharov, Fredo Durand, Joshua B Tenenbaum, Jiajun Wu, and Vincent Sitz- mann. Unsupervised discovery and composition of object light fields.arXiv preprint arXiv:2205.03923, 2022. 1, 3

work page arXiv 2022
[38]

De- composing 3d scenes into objects via unsupervised volume segmentation.arXiv preprint arXiv:2104.01148, 2021

Karl Stelzner, Kristian Kersting, and Adam R Kosiorek. De- composing 3d scenes into objects via unsupervised volume segmentation.arXiv preprint arXiv:2104.01148, 2021. 1, 3

work page arXiv 2021
[39]

Nguyen Xuan Vinh, Julien Epps, and James Bailey. Infor- mation theoretic measures for clusterings comparison: is a correction for chance necessary? InProceedings of the 26th annual international conference on machine learning, pages 1073–1080, 2009. 6

2009
[40]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2

2024
[41]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean conference on computer vision, pages 162–179. Springer, 2024. 1, 2, 4, 5, 6, 8

2024
[42]

Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning

Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao Yu, Ruqi Huang, and Lu Fang. Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20612–20622, 2024. 1

2024
[43]

Unsu- pervised discovery of object radiance fields.arXiv preprint arXiv:2107.07905, 2021

Hong-Xing Yu, Leonidas J Guibas, and Jiajun Wu. Unsu- pervised discovery of object radiance fields.arXiv preprint arXiv:2107.07905, 2021. 1, 3

work page arXiv 2021
[44]

Object-centric learning for real-world videos by pre- dicting temporal feature similarities.Advances in Neural In- formation Processing Systems, 36:61514–61545, 2023

Andrii Zadaianchuk, Maximilian Seitzer, and Georg Mar- tius. Object-centric learning for real-world videos by pre- dicting temporal feature similarities.Advances in Neural In- formation Processing Systems, 36:61514–61545, 2023. 3

2023
[45]

Drivinggaussian: Composite gaussian splatting for surrounding dynamic au- tonomous driving scenes

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Drivinggaussian: Composite gaussian splatting for surrounding dynamic au- tonomous driving scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21634–21643, 2024. 2

2024
[46]

Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting

Runsong Zhu, Shi Qiu, Zhengzhe Liu, Ka-Hei Hui, Qianyi Wu, Pheng-Ann Heng, and Chi-Wing Fu. Rethinking end- to-end 2d to 3d scene segmentation in gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3656–3665, 2025. 1, 2, 3, 5, 6, 8

2025
[47]

Objectgs: Object-aware scene reconstruction and scene understanding via gaussian splatting

Ruijie Zhu, Mulin Yu, Linning Xu, Lihan Jiang, Yixuan Li, Tianzhu Zhang, Jiangmiao Pang, and Bo Dai. Objectgs: Object-aware scene reconstruction and scene understanding via gaussian splatting. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 8350– 8360, 2025. 1, 2, 5, 6, 8 10

2025