pith. sign in

arxiv: 2607.01633 · v1 · pith:7IM432IXnew · submitted 2026-07-02 · 💻 cs.CV

Bridging 3D Gaussians and Semantic Occupancy for Comprehensive Open-Vocabulary Scene Understanding from Unposed Images

Pith reviewed 2026-07-03 16:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene understandingGaussian splattingsemantic occupancyopen-vocabulary segmentationpose-free reconstructionvolumetric liftingneural rendering
0
0 comments X

The pith

COVScene lifts semantic Gaussians into a dense occupancy field inside the training graph so volumetric regularization supplies gradients to opacity, geometry, and semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COVScene as a pose-free framework that couples 3D Gaussian primitives with a semantic occupancy field through differentiable volumetric lifting performed during training rather than only at test time. This design lets occupancy-based regularization send gradients back to the Gaussians, tightening constraints in regions that lack direct image observations. The method integrates a semantic-aware Geometry Transformer, multi-task decoding, foundation model distillation, and entropy regularization to support novel view synthesis, open-vocabulary querying, and occupancy prediction from the same representation. On ScanNet and ScanNet++ it preserves rendering quality while raising open-vocabulary segmentation accuracy and producing stronger semantic occupancy maps than a self-supervised baseline that receives no voxel-level labels.

Core claim

By lifting the predicted semantic Gaussians into an occupancy field inside the training computation graph, COVScene allows volumetric regularization to directly optimize Gaussian opacity, geometry, and semantic features, yielding improved open-vocabulary segmentation and semantic occupancy prediction from sparse unposed images without any direct voxel supervision.

What carries the argument

Differentiable volumetric lifting that converts semantic Gaussians into a dense occupancy field during training and back-propagates occupancy regularization gradients to the Gaussians.

If this is right

  • Novel view synthesis quality remains competitive while semantic tasks are added.
  • Open-vocabulary segmentation accuracy rises relative to image-space-only Gaussian baselines.
  • Semantic occupancy prediction improves over self-supervised methods that lack voxel supervision.
  • A single set of Gaussians supports rendering, semantic queries, and occupancy output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lifting idea could be applied to other explicit 3D representations to regularize unobserved space.
  • Removing the need for external calibration and voxel labels may simplify deployment on consumer cameras.
  • The occupancy field produced at inference time could be tested directly for downstream tasks such as path planning.

Load-bearing premise

That differentiable lifting of Gaussians into an occupancy field will generate useful gradients for unobserved regions and keep occupancy predictions accurate without any direct voxel-level supervision.

What would settle it

If ablating the volumetric lifting step during training produces equal or better semantic occupancy accuracy on held-out ScanNet scenes, the claimed benefit of the coupling collapses.

Figures

Figures reproduced from arXiv: 2607.01633 by Bohan Li, Chang Wen Chen, Hu Zhu, Wenjun Zeng, Xianda Guo, Xin Jin, Yanlun Peng, Zheng Zhu.

Figure 1
Figure 1. Figure 1: Versatile capabilities of COVScene. Given unposed multi-view images, a single forward pass directly constructs Comprehensive 3D Gaussians. This unified representation tightly couples appearance, geometry, and features, simultaneously enabling novel view synthesis, open-vocabulary semantic segmentation, depth estimation, and 3D semantic occupancy prediction. SAM mask features in Gaussian Splatting for effic… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of COVScene. Given multi-view unposed images, a semantic-aware Geometry Transformer estimates camera geometry, depth, and semantic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of Occupancy Entropy Regularization (Locc reg). Left: Without Locc reg, unposed 3DGS produces floating density artifacts in empty space. The proposed regularization suppresses these artifacts. Right: The entropy loss penalizes ambiguous occupancy probabilities (O(x) ≈ 0.5) and encourages a bimodal physical state, where each region is either free or occupied. We implement this prior by minimizing the… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of open-vocabulary segmentation. We evaluate segmentation on common ScanNet categories. COVScene produces more consistent 3D-aware semantic maps than feed-forward 3D baselines while remaining competitive with the 2D open-vocabulary model. direct voxel supervision, which reflects the difficulty of semantic occupancy prediction under weak supervision. Within the no-3D-label setting, CO… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of zero-shot occupancy prediction on in-the-wild [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Comprehensive 3D scene understanding from sparse, unposed images requires a model to recover renderable geometry, open-vocabulary semantics, and free/occupied 3D space without relying on external camera calibration. Recent feed-forward Gaussian methods improve pose-free reconstruction and semantic rendering, but their Gaussian primitives are mainly optimized through image-space objectives and remain weakly constrained in unobserved regions. We propose \textit{COVScene}, a pose-free semantic Gaussian framework that couples renderable Gaussian primitives with a dense semantic occupancy field through differentiable volumetric lifting. Instead of converting Gaussians to voxels only at evaluation time, COVScene lifts the predicted semantic Gaussians inside the training computation graph, so volumetric regularization provides gradients to Gaussian opacity, geometry, and semantic features. The framework combines a semantic-aware Geometry Transformer, multi-task Gaussian decoding, geometric foundation distillation, and occupancy entropy regularization to support novel view synthesis, open-vocabulary semantic querying, and semantic occupancy prediction within a single representation. Experiments on ScanNet and ScanNet++ show that COVScene maintains competitive rendering quality, improves open-vocabulary segmentation, and achieves stronger semantic occupancy prediction than the self-supervised baseline without direct voxel-level supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces COVScene, a pose-free framework that couples semantic 3D Gaussian primitives with a dense semantic occupancy field via differentiable volumetric lifting inside the training graph. Volumetric regularization (occupancy entropy) is claimed to back-propagate gradients to Gaussian opacity, geometry, and features, including in unobserved regions. The method integrates a semantic-aware Geometry Transformer, multi-task Gaussian decoding, geometric foundation distillation, and entropy regularization to jointly support novel-view synthesis, open-vocabulary semantic querying, and semantic occupancy prediction. Experiments on ScanNet and ScanNet++ report competitive rendering quality alongside gains in open-vocabulary segmentation and occupancy prediction relative to a self-supervised baseline that lacks direct voxel supervision.

Significance. If the differentiable lifting successfully supplies non-vanishing gradients to unobserved Gaussians, the work would meaningfully advance unified Gaussian-volumetric representations for self-supervised 3D scene understanding. The single-representation multi-task design and avoidance of external calibration or voxel labels are notable strengths; the use of geometric foundation distillation is also a constructive element.

major comments (2)
  1. [Abstract / §3] Abstract and method description: the central claim that occupancy entropy regularization supplies informative gradients to Gaussian parameters for regions outside all training camera frustums is load-bearing yet unsupported by any derivation or sensitivity analysis of the lifting operator (splatting/alpha-compositing integral). Without this, it remains possible that image-space losses dominate and unobserved geometry/semantics stay under-constrained, undermining the reported occupancy gains without voxel supervision.
  2. [Experiments] Experiments section: quantitative improvements on ScanNet are stated without error bars, standard deviations across runs, or ablations isolating the contribution of volumetric lifting versus the Geometry Transformer or foundation distillation; this prevents verification that the occupancy regularization is the operative factor.
minor comments (3)
  1. [Abstract] The abstract does not specify the exact form of the lifting operator or how occupancy is queried from Gaussians inside the computation graph, hindering immediate reproducibility.
  2. [Method] Notation for the semantic occupancy field and its relation to Gaussian opacity is introduced without an explicit equation linking the two representations.
  3. [Figures / Tables] Figure captions and table headers should clarify whether reported metrics are computed only on observed or also on extrapolated voxels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify areas where additional technical detail and experimental rigor would strengthen the manuscript. We address each point below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and method description: the central claim that occupancy entropy regularization supplies informative gradients to Gaussian parameters for regions outside all training camera frustums is load-bearing yet unsupported by any derivation or sensitivity analysis of the lifting operator (splatting/alpha-compositing integral). Without this, it remains possible that image-space losses dominate and unobserved geometry/semantics stay under-constrained, undermining the reported occupancy gains without voxel supervision.

    Authors: We agree that an explicit derivation of gradient flow through the differentiable volumetric lifting operator is needed to substantiate the claim. In the revised manuscript we will add a dedicated subsection deriving the back-propagation of the occupancy entropy loss through the alpha-compositing integral, showing that the integral admits non-zero gradients for Gaussians whose projected support lies outside any training frustum. We will also include a sensitivity analysis (gradient magnitude plots and ablation on frustum masking) confirming that the entropy term supplies informative gradients to unobserved regions rather than being dominated by image-space losses. revision: yes

  2. Referee: [Experiments] Experiments section: quantitative improvements on ScanNet are stated without error bars, standard deviations across runs, or ablations isolating the contribution of volumetric lifting versus the Geometry Transformer or foundation distillation; this prevents verification that the occupancy regularization is the operative factor.

    Authors: We acknowledge that the current experimental section lacks statistical reporting and targeted ablations. In the revision we will (i) report mean and standard deviation of all metrics across five independent runs with different random seeds, and (ii) add a new ablation table that systematically disables volumetric lifting while keeping the Geometry Transformer and foundation distillation fixed (and vice versa). These additions will isolate the contribution of occupancy entropy regularization and allow readers to verify its role in the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external distillation and independent regularization

full rationale

The abstract and description present COVScene as a new framework that introduces differentiable volumetric lifting of Gaussians into an occupancy field, combined with geometric foundation distillation and entropy regularization. No equations, fitted parameters, or self-citations are shown that reduce any prediction to its own inputs by construction. The lifting step is presented as an architectural choice providing gradients, not as a self-referential fit. The method is benchmarked against external baselines on ScanNet/ScanNet++ without evidence of load-bearing self-citation chains or ansatz smuggling. This is the common case of a self-contained proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5757 in / 1087 out tokens · 21130 ms · 2026-07-03T16:54:12.873841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 12 canonical work pages · 5 internal anchors

  1. [1]

    Pi-net: Point-to-image knowledge distillation for camera-based 3d semantic scene completion,

    Y . Xue, H. Pi, Z. Tang, K. Li, and R. Li, “Pi-net: Point-to-image knowledge distillation for camera-based 3d semantic scene completion,” IEEE Transactions on Multimedia, 2026. 9

  2. [2]

    From front to rear: 3d semantic scene completion through planar convolution and attention- based network,

    J. Li, Q. Song, X. Yan, Y . Chen, and R. Huang, “From front to rear: 3d semantic scene completion through planar convolution and attention- based network,”IEEE Transactions on Multimedia, 2023

  3. [3]

    3dgeodet: General-purpose geometry-aware image-based 3d object detection,

    Y . Zhang, Y . Wang, Y . Cui, and L.-P. Chau, “3dgeodet: General-purpose geometry-aware image-based 3d object detection,”IEEE Transactions on Multimedia, 2025

  4. [4]

    Convolutional neural network-based occupancy map accuracy improvement for video-based point cloud compression,

    W. Jia, L. Li, A. Akhtar, Z. Li, and S. Liu, “Convolutional neural network-based occupancy map accuracy improvement for video-based point cloud compression,”IEEE Transactions on Multimedia, 2022

  5. [5]

    3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,

    H. Xiong, Y . Zhuge, J. Zhu, L. Zhang, and H. Lu, “3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,” IEEE Transactions on Multimedia, 2025

  6. [6]

    Adaptive in adapter: Boosting open-vocabulary semantic segmentation with adaptive dropout adapter,

    C. Wang, W. Xu, R. Xu, Z. Zhang, S. Xu, J. Zhang, X. Teng, W. Meng, and X. Zhang, “Adaptive in adapter: Boosting open-vocabulary semantic segmentation with adaptive dropout adapter,”IEEE Transactions on Multimedia, 2026

  7. [7]

    Det-agent: Open- vocabulary object localization and detection with reinforcement learning agent,

    R. Zhang, X.-J. Wu, C. Wang, and C.-L. Liu, “Det-agent: Open- vocabulary object localization and detection with reinforcement learning agent,”IEEE Transactions on Multimedia, 2026

  8. [8]

    Unleash the power of vision-language models by visual attention prompt and multimodal interaction,

    W. Zhang, L. Wu, Z. Zhang, T. Yu, C. Ma, X. Jin, X. Yang, and W. Zeng, “Unleash the power of vision-language models by visual attention prompt and multimodal interaction,”IEEE Transactions on Multimedia, 2025

  9. [9]

    Cas-ovd: Cascaded open-vocabulary detection of small objects using multi-refined region proposal network in autonomous driving,

    Z. Fang, Y . Wu, J. Ren, J. Zheng, Y . Yan, and L. Zhang, “Cas-ovd: Cascaded open-vocabulary detection of small objects using multi-refined region proposal network in autonomous driving,”IEEE Transactions on Multimedia, 2026

  10. [10]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” inECCV, 2020

  11. [11]

    3d gaussian splatting for real-time radiance field rendering,

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics, 2023

  12. [12]

    Structure-from-motion revisited,

    J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016

  13. [13]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d re- construction,

    D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann, “pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d re- construction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  14. [14]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,

    Y . Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai, “Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,” inEuropean Conference on Computer Vision, 2024

  15. [15]

    VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

    W. Wang, Y . Chen, Z. Zhang, H. Liu, H. Wang, Z. Feng, W. Qin, Z. Zhu, D. Y . Chen, and B. Zhuang, “V olsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction,”arXiv preprint arXiv:2509.19297, 2025

  16. [16]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,

    B. Ye, S. Liu, H. Xu, X. Li, M. Pollefeys, M.-H. Yang, and S. Peng, “No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,” inInternational Conference on Learning Representa- tions (ICLR), 2025

  17. [17]

    Splatter image: Ultra- fast single-view 3d reconstruction,

    S. Szymanowicz, C. Rupprecht, and A. Vedaldi, “Splatter image: Ultra- fast single-view 3d reconstruction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

  18. [18]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    B. Smart, C. Zheng, I. Laina, and V . A. Prisacariu, “Splatt3r: Zero- shot gaussian splatting from uncalibrated image pairs,”arXiv preprint arXiv:2408.13912, 2024

  19. [19]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  20. [20]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  21. [21]

    Mapanything: Universal feed-forward metric 3d re- construction,

    N. Keetha, N. M ¨uller, J. Sch ¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antunes, J. Luiten, M. Lopez- Antequera, S. R. Bul `o, C. Richardt, D. Ramanan, S. Scherer, and P. Kontschieder, “Mapanything: Universal feed-forward metric 3d re- construction,” inInternational Conference on 3D Vision (3DV), 2026

  22. [22]

    Large spatial model: End-to- end unposed images to semantic 3d,

    Z. Fan, J. Zhang, W. Cong, P. Wang, R. Li, K. Wen, S. Zhou, A. Kadambi, Z. Wang, D. Xuet al., “Large spatial model: End-to- end unposed images to semantic 3d,”Advances in neural information processing systems, 2024

  23. [23]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,

    L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhaoet al., “Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,”ACM Transactions on Graphics (TOG), 2025

  24. [24]

    Uniforward: Unified 3d scene and semantic field reconstruction via feed-forward gaussian splat- ting from only sparse-view images,

    Q. Tian, X. Tan, J. Gong, Y . Xie, and L. Ma, “Uniforward: Unified 3d scene and semantic field reconstruction via feed-forward gaussian splat- ting from only sparse-view images,”arXiv preprint arXiv:2506.09378, 2025

  25. [25]

    Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from unposed multi-view images,

    X. Sun, H. Jiang, L. Liu, S. Nam, G. Kang, X. Wang, W. Sui, Z. Su, W. Liu, X. Wang, and E. Park, “Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from unposed multi-view images,”arXiv preprint arXiv:2508.03643, 2025

  26. [26]

    Language-driven semantic segmentation,

    B. Li, K. Q. Weinberger, S. J. Belongie, V . Koltun, and R. Ranftl, “Language-driven semantic segmentation,” inInternational Conference on Learning Representations, 2022

  27. [27]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning, 2021

  28. [28]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  29. [29]

    Semantic gaussians: Open-vocabulary scene understanding with 3d gaussian splatting,

    J. Guo, X. Ma, Y . Fan, H. Liu, and Q. Li, “Semantic gaussians: Open-vocabulary scene understanding with 3d gaussian splatting,”arXiv preprint arXiv:2403.15624, 2024

  30. [30]

    Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields,

    S. Zhou, H. Chang, S. Jiang, Z. Fan, Z. Zhu, D. Xu, P. Chari, S. You, Z. Wang, and A. Kadambi, “Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  31. [31]

    Language embedded 3d gaussians for open-vocabulary scene understanding,

    J.-C. Shi, M. Wang, H.-B. Duan, and S.-H. Guan, “Language embedded 3d gaussians for open-vocabulary scene understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2024

  32. [32]

    Language embedded 3d gaussians for open-vocabulary scene querying,

    M. Wang, J.-C. Shi, S.-H. Guan, and H.-B. Duan, “Language embedded 3d gaussians for open-vocabulary scene querying,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  33. [33]

    Spatialsplat: Efficient semantic 3d from sparse unposed images,

    Y . Sheng, J. Deng, X. Zhang, Y . Zhang, B. Hua, Y . Zhang, and J. Ji, “Spatialsplat: Efficient semantic 3d from sparse unposed images,”arXiv preprint arXiv:2505.23044, 2025

  34. [34]

    Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes,

    Y . Wang, T. Huang, H. Chen, and G. H. Lee, “Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes,” Advances in Neural Information Processing Systems, 2024

  35. [35]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017

  36. [36]

    Scannet++: A high- fidelity dataset of 3d indoor scenes,

    C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high- fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  37. [37]

    pixelNeRF: Neural radiance fields from one or few images,

    A. Yu, V . Ye, M. Tancik, and A. Kanazawa, “pixelNeRF: Neural radiance fields from one or few images,” inCVPR, 2021

  38. [38]

    Fusing panoptic seg- mentation and geometry information for robust visual slam in dynamic environments,

    H. Zhu, C. Yao, Z. Zhu, Z. Liu, and Z. Jia, “Fusing panoptic seg- mentation and geometry information for robust visual slam in dynamic environments,” inCASE, 2022

  39. [39]

    Instant neural graphics primitives with a multiresolution hash encoding,

    T. M ¨uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,”ACM Trans. Graph., 2022

  40. [40]

    K-planes: Explicit radiance fields in space, time, and appearance,

    S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and appearance,” in CVPR, 2023

  41. [41]

    Epipolar transformers,

    Y . He, R. Yan, K. Fragkiadaki, and S.-I. Yu, “Epipolar transformers,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  42. [42]

    Mvsnet: Depth inference for unstructured multi-view stereo,

    Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,”European Conference on Computer Vision (ECCV), 2018

  43. [43]

    Mvsplat360: Feed-forward 360 scene synthesis from sparse views,

    Y . Chen, C. Zheng, H. Xu, B. Zhuang, A. Vedaldi, T.-J. Cham, and J. Cai, “Mvsplat360: Feed-forward 360 scene synthesis from sparse views,”Advances in Neural Information Processing Systems (NeurIPS), 2024

  44. [44]

    Drivingforward: Feed-forward 3d gaussian splatting for driving scene reconstruction from flexible surround-view input,

    Q. Tian, X. Tan, Y . Xie, and L. Ma, “Drivingforward: Feed-forward 3d gaussian splatting for driving scene reconstruction from flexible surround-view input,” inProceedings of the AAAI Conference on Ar- tificial Intelligence, 2025

  45. [45]

    Yonosplat: You only need one model for feedforward 3d gaussian splatting,

    B. Ye, B. Chen, H. Xu, D. Barath, and M. Pollefeys, “Yonosplat: You only need one model for feedforward 3d gaussian splatting,” in International Conference on Learning Representations (ICLR), 2026

  46. [46]

    Mv-dust3r+: Single-stage scene reconstruction from sparse 10 views in 2 seconds,

    Z. Tang, Y . Fan, D. Wang, H. Xu, R. Ranjan, A. Schwing, and Z. Yan, “Mv-dust3r+: Single-stage scene reconstruction from sparse 10 views in 2 seconds,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  47. [47]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,

    S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein, “Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025

  48. [48]

    Pf3plat: Pose-free feed-forward 3d gaussian splatting for novel view synthesis,

    S. Hong, J. Jung, H. Shin, J. Han, J. Yang, C. Luo, and S. Kim, “Pf3plat: Pose-free feed-forward 3d gaussian splatting for novel view synthesis,” inForty-second International Conference on Machine Learning, 2025

  49. [49]

    Lerf: Language embedded radiance fields,

    J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inInternational Conference on Computer Vision (ICCV), 2023

  50. [50]

    One at a time: Progressive multi-step volumetric probability learning for reliable 3d scene perception,

    B. Li, Y . Sun, J. Dong, Z. Zhu, J. Liu, X. Jin, and W. Zeng, “One at a time: Progressive multi-step volumetric probability learning for reliable 3d scene perception,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 4, 2024, pp. 3028–3036

  51. [51]

    Garfield: Group anything with radiance fields,

    C. M. Kim, M. Wu, J. Kerr, M. Tancik, K. Goldberg, and A. Kanazawa, “Garfield: Group anything with radiance fields,” inConference on Computer Vision and Pattern Recognition (CVPR), 2024

  52. [52]

    Bridging stereo geometry and bev representation with reliable mutual interaction for semantic scene completion,

    B. Li, Y . Sun, Z. Liang, D. Du, Z. Zhang, X. Wang, Y . Wang, X. Jin, and W. Zeng, “Bridging stereo geometry and bev representation with reliable mutual interaction for semantic scene completion,”arXiv preprint arXiv:2303.13959, 2023

  53. [53]

    Langsplat: 3d language gaussian splatting,

    M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  54. [54]

    OmniNWM: Omniscient Driving Navigation World Models

    B. Li, Z. Ma, D. Du, B. Peng, Z. Liang, Z. Liu, C. Ma, Y . Jin, H. Zhao, W. Zenget al., “Omninwm: Omniscient driving navigation world models,”arXiv preprint arXiv:2510.18313, 2025

  55. [55]

    Occscene: Semantic occupancy-based cross-task mutual learning for 3d scene generation,

    B. Li, X. Jin, J. Wang, Y . Shi, Y . Sun, X. Wang, Z. Ma, B. Xie, C. Ma, X. Yanget al., “Occscene: Semantic occupancy-based cross-task mutual learning for 3d scene generation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  56. [56]

    Opensplat3d: Open-vocabulary 3d instance segmentation using gaussian splatting,

    J. Piekenbrinck, C. Schmidt, A. Hermans, N. Vaskevicius, T. Linder, and B. Leibe, “Opensplat3d: Open-vocabulary 3d instance segmentation using gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025

  57. [57]

    From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

    B. Li, S. Yang, B. Peng, X. Guo, E. Zhang, Y . Tao, J. Duan, D. Xu, Q. Dou, X. Jinet al., “From articulated kinematics to routed visual control for action-conditioned surgical video generation,”arXiv preprint arXiv:2605.08712, 2026

  58. [58]

    Uniscene: Unified occupancy-centric driving scene generation,

    B. Li, J. Guo, H. Liu, Y . Zou, Y . Ding, X. Chen, H. Zhu, F. Tan, C. Zhang, T. Wanget al., “Uniscene: Unified occupancy-centric driving scene generation,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 11 971–11 981

  59. [59]

    Hierarchical context alignment with disentangled geometric and temporal modeling for semantic occupancy prediction,

    B. Li, J. Deng, Y . Sun, X. Wang, X. Jin, and W. Zeng, “Hierarchical context alignment with disentangled geometric and temporal modeling for semantic occupancy prediction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  60. [60]

    4d langsplat: 4d language gaussian splatting via multimodal large language models,

    W. Li, R. Zhou, J. Zhou, Y . Song, J. Herter, M. Qin, G. Huang, and H. Pfister, “4d langsplat: 4d language gaussian splatting via multimodal large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  61. [61]

    Tun3d: Towards real-world scene understanding from unposed images,

    A. Konushin, N. Drozdov, B. Gabdullin, A. Zakharov, A. V orontsova, D. Rukhovich, and M. Kolodiazhnyi, “Tun3d: Towards real-world scene understanding from unposed images,” inIEEE International Conference on Robotics and Automation (ICRA), 2025

  62. [62]

    Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabu- laries,

    S. Peng, K. Genova, C. “. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “Openscene: 3d scene understanding with open vocabu- laries,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  63. [63]

    Roboocc: Enhancing the geometric and semantic scene understanding for robots,

    Z. Zhang, Q. Zhang, W. Cui, S. Shi, Y . Guo, G. Han, W. Zhao, H. Ren, R. Xu, and J. Tang, “Roboocc: Enhancing the geometric and semantic scene understanding for robots,”arXiv preprint arXiv:2504.14604, 2025

  64. [64]

    Gaussian grouping: Segment and edit anything in 3d scenes,

    M. Ye, M. Danelljan, F. Yu, and L. Ke, “Gaussian grouping: Segment and edit anything in 3d scenes,” inECCV, 2024

  65. [65]

    Ov-nerf: Open-vocabulary neural radiance fields with vision and language foundation models for 3d semantic understanding,

    G. Liao, K. Zhou, Z. Bao, K. Liu, and Q. Li, “Ov-nerf: Open-vocabulary neural radiance fields with vision and language foundation models for 3d semantic understanding,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

  66. [66]

    Fast and efficient: Mask neural fields for 3d scene segmentation,

    Z. Gao, L. Li, L. Jiao, F. Liu, X. Liu, W. Ma, Y . Guo, and S. Yang, “Fast and efficient: Mask neural fields for 3d scene segmentation,”arXiv preprint arXiv:2407.01220, 2024

  67. [67]

    FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views via Compact Semantic Representation

    Q. Tian, X. Tan, J. Ying, X. Wang, Y . Xie, and L. Ma, “Fleg: Feed- forward language embedded gaussian splatting from any views,”arXiv preprint arXiv:2512.17541, 2025

  68. [68]

    Monoscene: Monocular 3d semantic scene completion,

    A.-Q. Cao and R. De Charette, “Monoscene: Monocular 3d semantic scene completion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  69. [69]

    Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,

    Y . Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  70. [70]

    Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving,

    X. Tian, T. Jiang, L. Yun, Y . Mao, H. Yang, Y . Wang, Y . Wang, and H. Zhao, “Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving,” inThirty-seventh Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track, 2023

  71. [71]

    Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception,

    X. Wang, Z. Zhu, W. Xu, Y . Zhang, Y . Wei, X. Chi, Y . Ye, D. Du, J. Lu, and X. Wang, “Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception,”2023 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023

  72. [72]

    Semantickitti: A dataset for semantic scene understand- ing of lidar sequences,

    J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understand- ing of lidar sequences,”2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  73. [73]

    Scaling up occupancy-centric driving scene generation: Dataset and method,

    B. Li, X. Jin, H. Zhu, H. Liu, R. Li, J. Guo, K. Cai, C. Ma, Y . Jin, H. Zhaoet al., “Scaling up occupancy-centric driving scene generation: Dataset and method,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  74. [74]

    Tri- perspective view for vision-based 3d semantic occupancy prediction,

    Y .-K. Huang, W. Zheng, Y . Zhang, J. Zhou, and J. Lu, “Tri- perspective view for vision-based 3d semantic occupancy prediction,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2023

  75. [75]

    Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction,

    Y . Zhang, Z. Zhu, and D. Du, “Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction,”IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), 2023

  76. [76]

    Hierarchical temporal context learning for camera-based semantic scene completion,

    B. Li, J. Deng, W. Zhang, Z. Liang, D. Du, X. Jin, and W. Zeng, “Hierarchical temporal context learning for camera-based semantic scene completion,” inEuropean Conference on Computer Vision, 2024

  77. [77]

    V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion,

    Y . Li, Z. Yu, C. B. Choy, C. Xiao, J. M. ´Alvarez, S. Fidler, C. Feng, and A. Anandkumar, “V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion,”2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  78. [78]

    Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,

    Y . Huang, W. Zheng, Y . Zhang, J. Zhou, and J. Lu, “Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,” European Conference on Computer Vision (ECCV), 2024

  79. [79]

    Gaussianformer-2: Probabilistic gaussian superposition for ef- ficient 3d occupancy prediction,

    Y . Huang, A. Thammatadatrakoon, W. Zheng, Y . Zhang, D. Du, and J. Lu, “Gaussianformer-2: Probabilistic gaussian superposition for ef- ficient 3d occupancy prediction,”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  80. [80]

    Semantic causality-aware vision-based 3d occupancy predic- tion,

    D. Chen, H. Zheng, Y . Zhou, X. Li, W. Liao, T. He, P. Peng, and J. Shen, “Semantic causality-aware vision-based 3d occupancy predic- tion,”IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Showing first 80 references.