pith. machine review for the scientific record. sign in

arxiv: 2604.13476 · v2 · submitted 2026-04-15 · 💻 cs.RO · cs.CV

Recognition: unknown

RobotPan: A 360^circ Surround-View Robotic Vision System for Embodied Perception

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:19 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords surround-view perception3D Gaussian splattingrobotic visionfeed-forward reconstructionnovel view synthesisreal-time renderingmulti-sensor dataset
0
0 comments X

The pith

RobotPan turns six surround cameras into compact 3D Gaussians for real-time 360-degree robotic rendering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a surround-view robotic vision system that combines six cameras with LiDAR to deliver full 360-degree coverage without manual switching or motion jitter that causes simulator sickness. It presents RobotPan, a feed-forward framework that lifts calibrated multi-view features into a spherical coordinate system and decodes them into metric-scaled 3D Gaussians using hierarchical spherical voxel priors. These priors assign fine resolution near the robot and coarser resolution farther away to cut redundancy while preserving detail. An online fusion step selectively updates appearance to handle long sequences without letting the Gaussian count grow unbounded in static areas. The result is a compact representation that supports real-time reconstruction, rendering, and streaming on actual robot platforms for navigation, manipulation, and locomotion.

Core claim

RobotPan lifts multi-view features into a unified spherical coordinate representation and decodes Gaussians using hierarchical spherical voxel priors, allocating fine resolution near the robot and coarser resolution at larger radii to reduce computational redundancy without sacrificing fidelity. To support long sequences, our online fusion updates dynamic content while preventing unbounded growth in static regions by selectively updating appearance.

What carries the argument

Hierarchical spherical voxel priors that allocate resolution by distance from the robot to decode compact metric 3D Gaussians from sparse calibrated views.

If this is right

  • Achieves competitive quality against prior feed-forward reconstruction and view-synthesis methods.
  • Produces substantially fewer Gaussians than those methods, enabling practical real-time embodied deployment.
  • Delivers full 360-degree visual coverage that meets the geometric and real-time constraints of robotic platforms.
  • Supports navigation, manipulation, and locomotion tasks via the released multi-sensor dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Head-mounted display users in teleoperation could see reduced simulator sickness because jitter is avoided at the source.
  • The compact representation may allow longer continuous operation than methods that accumulate unbounded elements.
  • The spherical prior design could be tested on platforms with different camera counts or LiDAR densities to measure robustness.

Load-bearing premise

The hierarchical spherical voxel priors and selective online fusion can maintain fidelity and prevent unbounded growth in real-world robotic sequences with dynamic content.

What would settle it

Run the system on an extended sequence containing moving objects and check whether Gaussian count stays below practical limits while visual quality remains competitive with prior feed-forward baselines.

Figures

Figures reproduced from arXiv: 2604.13476 by Gang Han, Jiahao Ma, Jian Tang, Miaomiao Liu, Peiran Liu, Pihai Sun, Qiang Zhang, Renjing Xu, Wei Cui, Wen Zhao, Yijie Guo, Zeran Su, Zhang Zhang, Zhiyuan Xu.

Figure 1
Figure 1. Figure 1: System of RobotPan: surround-view robotic vision system for real-time embodied perception which is deployed on the Tiangong 3.0 humanoid platform. Our system combines six cameras and LiDAR to provide full 360◦ visual coverage for embodied robot operation. From calibrated sparse multi-view observations, ROBOTPAN predicts metric-scaled and compact 3D Gaussians, enabling real-time surround-view rendering, nov… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of RobotPan. Multi-view images are encoded by a transformer to predict per-view depth and features. The reconstructed 3D points are converted to robot-centric spherical coordinates, voxelized into hierarchical spherical cells, and aggregated into anchor features, which are finally decoded into compact 3D Gaussian parameters for real-time rendering and reconstruction. equivariance. Finally, to ensu… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-view consistent dynamic region identification via range-image fusion. We segment dynamic regions per view, recon￾struct a shared 3D point cloud, project points to spherical coordinates to form a panoramic range image, and fuse multi-view results in the range￾image domain to mitigate missed detections, yielding a robust dynamic￾region mask for dynamic/static splitting. p ∈ Ω t i we have a metric 3D ta… view at source ↗
Figure 4
Figure 4. Figure 4: Tiangong 3.0 robot head and its sensor layout. Left: front and side views of the robot head. Right: orthographic views of the head￾mounted sensing system, showing the arrangement of six RGB cameras and one LiDAR with annotated dimensions and viewing directions [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensor layout and data acquisition. Left and middle: side and isometric views of our surround-view rig with six RGB cameras and a 40- beam LiDAR. Right: wearable data collection setup with sensor height matched to the humanoid robot. 3 EXPERIMENTS We first introduce the design of our robot head, detailing the surround-view camera–LiDAR configuration and mounting geometry. Next, we describe our data collect… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of feed-forward 3D reconstruction. For each example, we show the input images, the LiDAR point cloud as ground truth, and the reconstructed geometry from different methods. Our method recovers more complete structures with sharper details [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of monocular depth prediction. For each example, we show the input image, predicted depth maps from different methods, and zoomed-in regions. Our method produces clearer boundaries and more consistent geometries across diverse scenes. the reconstruction module of RobotPan provides a strong geometric foundation for downstream robotic tasks, in￾cluding mapping, localization, navigation… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on DL3DV-Benchmarks and RealEstate10K. We show sparse input views and compare novel-view renderings from different methods against the ground truth. Our method preserves finer structures and cleaner boundaries. TABLE 5 Quantitative comparison of offline and online training methods on the proposed multi-view camera dynamic scene dataset. Category Method PSNR↑ Train↓ Render↑ Storage↓ (… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of novel view synthesis results against state-of-the-art methods. The first column shows the selected input reference views, and the remaining columns present the rendered novel views from different methods alongside the ground truth. Across challenging scenes, our method produces sharper structures, more faithful geometry, and fewer visual artifacts than competing approaches. RealEs… view at source ↗
read the original abstract

Surround-view perception is increasingly important for robotic navigation and loco-manipulation, especially in human-in-the-loop settings such as teleoperation, data collection, and emergency takeover. However, current robotic visual interfaces are often limited to narrow forward-facing views, or, when multiple on-board cameras are available, require cumbersome manual switching that interrupts the operator's workflow. Both configurations suffer from motion-induced jitter that causes simulator sickness in head-mounted displays. We introduce a surround-view robotic vision system that combines six cameras with LiDAR to provide full 360$^\circ$ visual coverage, while meeting the geometric and real-time constraints of embodied deployment. We further present \textsc{RobotPan}, a feed-forward framework that predicts \emph{metric-scaled} and \emph{compact} 3D Gaussians from calibrated sparse-view inputs for real-time rendering, reconstruction, and streaming. \textsc{RobotPan} lifts multi-view features into a unified spherical coordinate representation and decodes Gaussians using hierarchical spherical voxel priors, allocating fine resolution near the robot and coarser resolution at larger radii to reduce computational redundancy without sacrificing fidelity. To support long sequences, our online fusion updates dynamic content while preventing unbounded growth in static regions by selectively updating appearance. Finally, we release a multi-sensor dataset tailored to 360$^\circ$ novel view synthesis and metric 3D reconstruction for robotics, covering navigation, manipulation, and locomotion on real platforms. Experiments show that \textsc{RobotPan} achieves competitive quality against prior feed-forward reconstruction and view-synthesis methods while producing substantially fewer Gaussians, enabling practical real-time embodied deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RobotPan, a feed-forward framework that predicts metric-scaled compact 3D Gaussians from calibrated sparse multi-view camera and LiDAR inputs for 360° surround-view robotic perception. It lifts features into a unified spherical coordinate representation, decodes Gaussians via hierarchical spherical voxel priors that allocate finer resolution near the robot and coarser at distance, and employs selective online fusion to update dynamic content while preventing unbounded growth in static regions. A new multi-sensor dataset for navigation, manipulation, and locomotion is released, with experiments claiming competitive quality against prior feed-forward reconstruction and view-synthesis methods alongside substantially fewer Gaussians for real-time embodied deployment.

Significance. If the quantitative claims are substantiated, the work has clear significance for robotic vision by enabling practical real-time 360° rendering and reconstruction that mitigates motion jitter and supports teleoperation and loco-manipulation without the computational burden of dense representations or manual view switching.

major comments (2)
  1. [Online Fusion] The selective online fusion (described after the hierarchical priors in the method) is load-bearing for the central claim of bounded Gaussian count and real-time viability in long sequences, yet the manuscript provides no explicit criteria, thresholds, or pseudocode for the dynamic-content selection heuristic. This leaves open the risk that slow-moving or lighting-varying elements are misclassified, undermining the 'preventing unbounded growth' assertion.
  2. [Experiments] The experimental claims of 'competitive quality' and 'substantially fewer Gaussians' (abstract and §5) rest on comparisons to prior feed-forward methods, but the provided text references results without detailing specific metrics such as PSNR/SSIM deltas, per-frame Gaussian counts, or runtime tables; this weakens verification of the practical deployment advantage.
minor comments (2)
  1. [Abstract] The abstract would benefit from approximate numerical values (e.g., 'X% fewer Gaussians' or 'Y ms/frame') to make the 'substantially fewer' claim more concrete.
  2. [Method] Clarify the exact definition of 'hierarchical resolution allocation thresholds' in the voxel prior description to avoid ambiguity in reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will implement to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Online Fusion] The selective online fusion (described after the hierarchical priors in the method) is load-bearing for the central claim of bounded Gaussian count and real-time viability in long sequences, yet the manuscript provides no explicit criteria, thresholds, or pseudocode for the dynamic-content selection heuristic. This leaves open the risk that slow-moving or lighting-varying elements are misclassified, undermining the 'preventing unbounded growth' assertion.

    Authors: We agree that the selective online fusion requires more explicit documentation to substantiate the bounded Gaussian count claim. In the revised manuscript we will add the precise selection criteria (feature-difference and depth-consistency thresholds), the decision logic for classifying dynamic versus static content, and pseudocode for the fusion step. These additions will also clarify safeguards against misclassification of slow-moving objects or lighting changes. revision: yes

  2. Referee: [Experiments] The experimental claims of 'competitive quality' and 'substantially fewer Gaussians' (abstract and §5) rest on comparisons to prior feed-forward methods, but the provided text references results without detailing specific metrics such as PSNR/SSIM deltas, per-frame Gaussian counts, or runtime tables; this weakens verification of the practical deployment advantage.

    Authors: We acknowledge that the current presentation of quantitative results could be more explicit. In the revised manuscript we will include expanded tables reporting PSNR, SSIM, LPIPS, per-frame Gaussian counts, and runtime measurements with direct comparisons to baselines, thereby making the claimed advantages in quality and efficiency fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on independent architectural choices

full rationale

The paper presents RobotPan as a feed-forward network that lifts multi-view features into spherical coordinates and decodes Gaussians via hierarchical voxel priors plus selective online fusion. These are explicit design decisions allocating resolution and controlling growth; no equation or claim equates a prediction to its own fitted inputs by construction. No self-citations appear in the provided text, and the fewer-Gaussians outcome is stated as an empirical result of the priors and fusion heuristic rather than a definitional identity. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims rest on newly introduced architectural components without external validation or proofs in the provided abstract; the dataset release provides some grounding but the method details are high-level.

free parameters (1)
  • hierarchical resolution allocation thresholds
    Parameters controlling fine resolution near the robot versus coarser at larger radii are introduced to reduce redundancy but their specific values or fitting process are not detailed.
axioms (1)
  • domain assumption Calibrated sparse-view inputs from six cameras suffice for metric-scaled 3D reconstruction in robotic environments
    Invoked in the lifting to unified spherical representation and Gaussian decoding.
invented entities (2)
  • hierarchical spherical voxel priors no independent evidence
    purpose: To allocate resolution based on distance from the robot for efficient Gaussian decoding
    Newly proposed component in the framework with no independent evidence provided.
  • selective online fusion for dynamic content no independent evidence
    purpose: To update appearance in long sequences while preventing unbounded growth in static regions
    New mechanism introduced to support real-time streaming without specified external validation.

pith-pipeline@v0.9.0 · 5641 in / 1463 out tokens · 69763 ms · 2026-05-10T13:19:56.388161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 16 canonical work pages

  1. [1]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 20 697–20 709

  2. [2]

    Grounding image matching in 3d with mast3r,

    V . Leroy, Y. Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inEuropean Conference on Computer Vision (ECCV), 2024

  3. [3]

    Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization,

    S. Dong, S. Wang, S. Liu, L. Cai, Q. Fan, J. Kannala, and Y. Yang, “Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  4. [4]

    3d reconstruction with spatial memory,

    H. Wang and L. Agapito, “3d reconstruction with spatial memory,” inInternational Conference on 3D Vision (3DV), 2025

  5. [5]

    Must3r: Multi-view network for stereo 3d reconstruction,

    Y. Cabon, L. Stoffl, L. Antsfeld, G. Csurka, B. Chidlovskii, J. Re- vaud, and V . Leroy, “Must3r: Multi-view network for stereo 3d reconstruction,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

  6. [6]

    Slam3r: Real-time dense scene reconstruction from monocular rgb videos,

    Y. Liu, S. Dong, S. Wang, Y. Yang, Q. Fan, and B. Chen, “Slam3r: Real-time dense scene reconstruction from monocular rgb videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  7. [7]

    Continuous 3d perception model with persistent state,

    Q. Wang, Y. Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  8. [8]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  9. [9]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,

    J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli, “Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  10. [10]

    3d gaus- sian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaus- sian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  11. [11]

    Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds,

    Z. Tang, Y. Fan, D. Wang, H. Xu, R. Ranjan, A. Schwing, and Z. Yan, “Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 5283–5293

  12. [12]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,

    S. Zhang, J. Wang, Y. Xu, N. Xue, C. Rupprecht, X. Zhou, Y. Shen, and G. Wetzstein, “Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 936–21 947

  13. [13]

    π 3: Permutation-equivariant visual geometry learning,

    Y. Wang, J. Zhou, H. Zhu, W. Chang, Y. Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Permutation-equivariant visual geometry learning,” inInternational Conference on Learning Repre- sentations (ICLR), 2026

  14. [14]

    TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv preprint :2509.26645, 2025

    X. Chen, Y. Chen, Y. Xiu, A. Geiger, and A. Chen, “Ttt3r: 3d re- construction as test-time training,”arXiv preprint arXiv:2509.26645, 2025

  15. [15]

    Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

    Y. Shen, Z. Zhang, Y. Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao, “Fastvggt: Training-free acceleration of visual geometry trans- former,”arXiv preprint arXiv:2509.02560, 2025

  16. [16]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P . P . Srinivasan, M. Tancik, J. T. Barron, R. Ra- mamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

  17. [17]

    pixelnerf: Neural radiance fields from one or few images,

    A. Yu, V . Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  18. [18]

    Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo,

    A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su, “Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  19. [19]

    Hashpoint: Accelerated point searching and sampling for neural rendering,

    J. Ma, M. Liu, D. Ahmedt-Aristizabal, and C. Nguyen, “Hashpoint: Accelerated point searching and sampling for neural rendering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4462–4472

  20. [20]

    Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

    K. Ren, L. Jiang, T. Lu, M. Yu, L. Xu, Z. Ni, and B. Dai, “Octree- gs: Towards consistent real-time rendering with lod-structured 3d gaussians,”arXiv preprint arXiv:2403.17898, 2024

  21. [21]

    Scaffold-gs: Structured 3d gaussians for view-adaptive render- ing,

    T. Lu, M. Yu, L. Xu, Y. Xiangli, L. Wang, D. Lin, and B. Dai, “Scaffold-gs: Structured 3d gaussians for view-adaptive render- ing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 20 654–20 664

  22. [22]

    Mip-splatting: Alias-free 3d gaussian splatting,

    Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger, “Mip-splatting: Alias-free 3d gaussian splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  23. [23]

    Puzzles: Unbounded video-depth augmentation for scalable end-to-end 3d reconstruction,

    J. Ma, L. Wang, D. Ahmedt-Aristizabal, C. Nguyenet al., “Puzzles: Unbounded video-depth augmentation for scalable end-to-end 3d reconstruction,”arXiv preprint arXiv:2506.23863, 2025

  24. [24]

    Dchm: Depth-consistent human modeling for multiview detec- tion,

    J. Ma, T. Wang, M. Liu, D. Ahmedt-Aristizabal, and C. Nguyen, “Dchm: Depth-consistent human modeling for multiview detec- tion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 7731–7740

  25. [25]

    Geonerf: Generalizing nerf with geometry priors,

    M. Johari, C. Carta, and F. Fleuret, “Geonerf: Generalizing nerf with geometry priors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  26. [26]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction,

    D. Charatan, S. Li, A. Tagliasacchi, and V . Sitzmann, “pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 19 457– 19 467

  27. [27]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,

    Y. Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai, “Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images,” inEuropean Conference on Computer Vision (ECCV), 2024

  28. [28]

    Freesplat: Gen- eralizable 3d gaussian splatting towards free-view synthesis of indoor scenes,

    Y. Wang, T. Huang, H. Chen, and G. H. Lee, “Freesplat: Gen- eralizable 3d gaussian splatting towards free-view synthesis of indoor scenes,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

  29. [29]

    Streamgs: Online generalizable gaussian splatting reconstruction for unposed image streams,

    Y. Li, J. Wang, L. Chu, X. Li, S.-H. Kao, Y.-C. Chen, and Y. Lu, “Streamgs: Online generalizable gaussian splatting reconstruction for unposed image streams,” inProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), October 2025, pp. 25 841–25 850

  30. [30]

    Salon3r: Structure-aware long-term generaliz- able 3d reconstruction from unposed images,

    J. Guo, T. Guan, W. Dong, W. Zheng, W. Wang, Y. Wang, Y. Yam, and Y.-H. Liu, “Salon3r: Structure-aware long-term generaliz- able 3d reconstruction from unposed images,”arXiv preprint arXiv:2510.15072, 2025

  31. [31]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,

    B. Ye, S. Liu, H. Xu, X. Li, M. Pollefeys, M.-H. Yang, and S. Peng, “No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images,” inInternational Conference on Learn- ing Representations (ICLR), 2025

  32. [32]

    Splatt3r: Zero- shot gaussian splatting from uncalibrated image pairs,

    B. Smart, C. Zheng, I. Laina, and V . A. Prisacariu, “Splatt3r: Zero- shot gaussian splatting from uncalibrated image pairs,” 2024

  33. [33]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,

    L. Jiang, Y. Mao, L. Xu, T. Lu, K. Ren, Y. Jin, X. Xu, M. Yu, J. Pang, F. Zhao, D. Lin, and B. Dai, “Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,”ACM Transactions on Graphics (TOG), 2025

  34. [34]

    Depthsplat: Connecting gaussian splatting and depth,

    H. Xu, S. Peng, F. Wang, H. Blum, D. Barath, A. Geiger, and M. Pollefeys, “Depthsplat: Connecting gaussian splatting and depth,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 453–16 463

  35. [35]

    Gs-lrm: Large reconstruction model for 3d gaussian splatting,

    K. Zhang, S. Bi, H. Tan, Y. Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu, “Gs-lrm: Large reconstruction model for 3d gaussian splatting,” inEuropean Conference on Computer Vision (ECCV), 2024

  36. [36]

    Compgs: Efficient 3d scene representation via compressed gaussian splatting,

    X. Liu, X. Wu, P . Zhang, S. Wang, Z. Li, and S. Kwong, “Compgs: Efficient 3d scene representation via compressed gaussian splatting,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024. [Online]. Available: https: //dl.acm.org/doi/10.1145/3610977.3630138

  37. [38]

    Available: https://arxiv.org/abs/2405.19479

    [Online]. Available: https://arxiv.org/abs/2405.19479

  38. [39]

    Monoscene: Monocular 3d seman- tic scene completion,

    A.-Q. Cao and R. de Charette, “Monoscene: Monocular 3d seman- tic scene completion,” inCVPR, 2022

  39. [40]

    Sur- roundocc: Multi-camera 3d occupancy prediction for autonomous driving,

    Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Sur- roundocc: Multi-camera 3d occupancy prediction for autonomous driving,” inICCV, 2023

  40. [41]

    Tri-perspective view for vision-based 3d semantic occupancy prediction,

    Y. Huang, W. Zheng, Y. Zhang, J. Zhou, and J. Lu, “Tri-perspective view for vision-based 3d semantic occupancy prediction,” in CVPR, 2023

  41. [42]

    Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,

    ——, “Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,” inECCV, 2024

  42. [43]

    Embodiedocc: Embodied 3d occupancy prediction for vision-based online scene understanding.arXiv preprint arXiv:2412.04380,

    Y. Wu, W. Zheng, S. Zuo, Y. Huang, J. Zhou, and J. Lu, “Embod- iedocc: Embodied 3d occupancy prediction for vision-based online scene understanding,”arXiv preprint arXiv:2412.04380, 2024. IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 12

  43. [44]

    Roboocc: Enhancing the geometric and semantic scene understanding for robots.arXiv preprint arXiv:2504.14604, 2025

    Z. Zhang, Q. Zhang, W. Cui, S. Shi, Y. Guo, G. Han, W. Zhao, H. Ren, R. Xu, and J. Tang, “Roboocc: Enhancing the geometric and semantic scene understanding for robots,”arXiv preprint arXiv:2504.14604, 2025

  44. [45]

    Mobileocc: A human-aware semantic occupancy dataset for mobile robots,

    J. Kimet al., “Mobileocc: A human-aware semantic occupancy dataset for mobile robots,”arXiv preprint arXiv:2511.16949, 2025

  45. [46]

    Humanoidpano: Hybrid spherical panoramic- lidar cross-modal perception for humanoid robots,

    Q. Zhang, Z. Zhang, W. Cui, J. Sun, J. Cao, Y. Guo, G. Han, W. Zhao, J. Wang, C. Sun, L. Zhang, H. Cheng, Y. Chen, L. Wang, J. Tang, and R. Xu, “Humanoidpano: Hybrid spherical panoramic- lidar cross-modal perception for humanoid robots,”arXiv preprint arXiv:2503.09010, 2025

  46. [47]

    Panoramic multimodal semantic occupancy predic- tion for quadruped robots,

    G. Zhaoet al., “Panoramic multimodal semantic occupancy predic- tion for quadruped robots,”arXiv preprint arXiv:2603.13108, 2026

  47. [48]

    Oneocc: Semantic occupancy prediction for legged robots with a single panoramic camera,

    H. Shi, Z. Wang, S. Guo, M. Duan, S. Wang, T. Chen, K. Yang, L. Wang, and K. Wang, “Oneocc: Semantic occupancy prediction for legged robots with a single panoramic camera,”arXiv preprint arXiv:2511.03571, 2025

  48. [49]

    Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments.arXiv preprint arXiv:2505.19214, 2025

    Z. Wang, T. Ma, Y. Jia, X. Yang, J. Zhou, W. Ouyang, Q. Zhang, and J. Liang, “Omni-perception: Omnidirectional collision avoidance for legged locomotion in dynamic environments,”arXiv preprint arXiv:2505.19214, 2025

  49. [50]

    Quadreamer: Controllable panoramic video generation for quadruped robots,

    S. Wu, F. Teng, H. Shi, Q. Jiang, K. Luo, K. Wang, and K. Yang, “Quadreamer: Controllable panoramic video generation for quadruped robots,” inCoRL, 2025

  50. [51]

    What makes for text to 360-degree panorama generation with stable diffusion?

    J. Ni, C.-B. Zhang, Q. Zhang, and J. Zhang, “What makes for text to 360-degree panorama generation with stable diffusion?” inICCV, 2025

  51. [52]

    Humanoid occupancy: Enabling a generalized multi- modal occupancy perception system on humanoid robots,

    W. Cui, H. Wang, W. Qin, Y. Guo, G. Han, W. Zhao, J. Cao, Z. Zhang, J. Zhong, J. Sun, P . Sun, S. Shi, B. Jiang, J. Ma, J. Wang, H. Cheng, Z. Liu, Y. Wang, Z. Zhu, G. Huang, J. Tang, and Q. Zhang, “Humanoid occupancy: Enabling a generalized multi- modal occupancy perception system on humanoid robots,”arXiv preprint arXiv:2507.20217, 2025

  52. [53]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,” Transactions on Machine Learning Research (TMLR), 2024

  53. [54]

    Point-nerf: Point-based neural radiance fields,

    Q. Xu, Z. Xu, J. Philip, S. Bi, Z. Shu, K. Sunkavalli, and U. Neu- mann, “Point-nerf: Point-based neural radiance fields,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5438–5448

  54. [55]

    3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free-viewpoint videos,

    J. Sun, H. Jiao, G. Li, Z. Zhang, L. Zhao, and W. Xing, “3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free-viewpoint videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2024, pp. 20 675–20 685

  55. [56]

    URL https://doi.org/10.1145/3528223

    T. M ¨uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,”ACM Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022. [Online]. Available: https://doi.org/10.1145/3528223.3530127

  56. [57]

    Large scale multi-view stereopsis evaluation,

    R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs, “Large scale multi-view stereopsis evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 406– 413

  57. [58]

    A multi-view stereo benchmark with high-resolution images and multi-camera videos,

    T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260–3269

  58. [59]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,

    L. Ling, Y. Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y. Luet al., “Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 160–22 169

  59. [60]

    Stereo magnification: Learning view synthesis using multiplane images,

    T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” ACM Transactions on Graphics (TOG), vol. 37, no. 4, pp. 65:1–65:12, 2018

  60. [61]

    K-planes: Explicit radiance fields in space, time, and appearance,

    S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and appearance,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 12 479– 12 488

  61. [62]

    4d gaussian splatting for real-time dynamic scene rendering,

    G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang, “4d gaussian splatting for real-time dynamic scene rendering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 20 310–20 320

  62. [63]

    Spacetime gaussian feature splatting for real-time dynamic view synthesis,

    Z. Li, Z. Chen, Z. Li, and Y. Xu, “Spacetime gaussian feature splatting for real-time dynamic view synthesis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 8508–8520

  63. [64]

    Streaming radiance fields for 3d video synthesis,

    L. Li, Z. Shen, Z. Wang, L. Shen, and P . Tan, “Streaming radiance fields for 3d video synthesis,”Advances in Neural Information Processing Systems, vol. 35, pp. 13 485–13 498, 2022

  64. [65]

    Instant gaussian stream: Fast and generalizable stream- ing of dynamic scene reconstruction via gaussian splatting,

    J. Yan, R. Peng, Z. Wang, L. Tang, J. Yang, J. Liang, J. Wu, and R. Wang, “Instant gaussian stream: Fast and generalizable stream- ing of dynamic scene reconstruction via gaussian splatting,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 520–16 531