pith. sign in

arxiv: 2606.22971 · v1 · pith:J6QC3NAQnew · submitted 2026-06-22 · 💻 cs.RO · cs.CV

Humanoid-OmniOcc: Stereo-Based Full-View Occupancy Dataset for Embodied AI

Pith reviewed 2026-06-26 08:26 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords occupancy predictionhumanoid robotsstereo visionsim-to-real transferembodied AIpanoramic perception3D scene understandingindoor navigation
0
0 comments X

The pith

A stereo panoramic dataset built via Real2Sim2Real lets humanoid robots predict full-view occupancy more accurately than monocular methods and transfers to real captures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Humanoid-OmniOcc, a dataset with over 155K samples drawn from 15 simulated indoor scenes and 5 real environments, constructed so that real sensor specifications determine the simulation used to generate training labels. It introduces a surround stereo model that lifts 2D images to 3D occupancy using depth priors obtained from stereo pairs rather than single views. This combination targets the mismatch between existing vehicle-centric occupancy data and the needs of humanoid robots that must perceive in all directions inside buildings. If the central claim holds, training on the simulated portion produces models whose accuracy remains high when tested directly on real-world recordings, closing a practical gap for safe navigation and interaction.

Core claim

The central claim is that a closed-loop Real2Sim2Real pipeline, in which real camera specifications drive physically accurate simulation to produce large-scale labeled panoramic stereo data, enables a stereo-guided occupancy model to outperform monocular baselines while generalizing to both unseen simulated scenes and real-world captures.

What carries the argument

The Humanoid Surround Stereo-guided Occupancy model that uses stereo depth priors to perform accurate 2D-to-3D lifting, supported by the Real2Sim2Real dataset construction process.

If this is right

  • Stereo inputs produce higher occupancy accuracy than monocular inputs across the evaluated scenes.
  • Models trained on the simulated data maintain performance on previously unseen simulated indoor environments.
  • The same models retain usable accuracy when evaluated on real-world stereo captures.
  • The Real2Sim2Real loop supports repeated cycles of simulation improvement and model retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sensor-driven simulation approach could be applied to other robot morphologies that require wide-field perception.
  • Full-view occupancy maps may lower collision rates during close-range manipulation tasks inside homes or offices.
  • The dataset supplies a controlled testbed for measuring how much stereo information reduces depth ambiguity relative to monocular methods.

Load-bearing premise

The simulation accurately reproduces real sensor behavior and physical environments so models trained inside it perform well on actual robot captures.

What would settle it

A large accuracy drop on the real-world test captures compared with the simulated test scenes after training on the simulated data.

Figures

Figures reproduced from arXiv: 2606.22971 by Bohao Zhang, Chenwei Huang, Cong Yang, Qin Zou, Ruilin Wang, Shiyuan Chen, Wei Sui, Xianda Guo, Yiqun Duan.

Figure 1
Figure 1. Figure 1: Illustration of the proposed Humanoid-OmniOcc dataset. Left: Six representative scenes rendered in high photorealistic quality, covering diverse spatial layouts and material textures. Right: Visualization of one scene with four stereo RGB pairs (Front, Rear, Left, Right), their corresponding depth maps, and voxelized occupancy labels. struggle with geometric reliability or incur prohibitive hardware costs.… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different 3D perception paradigms. Top: Monocular-based model. Middle: Multi-sensor fusion model. Bottom: Our proposed stereo-based model. We introduce Humanoid-OmniOcc, a stereo￾based panoramic occupancy dataset tailored for humanoid perception, designed around a Real2Sim2Real closed-loop paradigm. Built on NVIDIA Isaac Sim, a head-like rig of four synchronized stereo cameras provides full￾s… view at source ↗
Figure 3
Figure 3. Figure 3: The data collection setup, featuring the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The pipeline of our proposed HS2Occ framework. 4 HS2OccModel 4.1 Task Definition As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons on the test set (first row) and real-world scenes (rows 2–3). The first column shows the left images from the four stereo pairs (front, back, left, and right). 6 Conclusion In this work, we presented Humanoid-OmniOcc, a panoramic stereo-based occupancy benchmark for embodied humanoid perception, featuring high-quality voxel-level annotations across diverse simulated and real-world i… view at source ↗
Figure 6
Figure 6. Figure 6: Performance scaling of HS2Occ with training and testing data. .4 More Analysis on the Humanoid-OmniOcc Dataset [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Occupancy prediction at voxel-level granularity is essential for safe robotic navigation and interaction in complex environments. Existing occupancy datasets, however, are predominantly designed for autonomous driving with vehicle-centric biases -- forward-facing cameras, far-field geometry, and static road priors -- limiting their applicability to embodied humanoid perception. We present Humanoid-OmniOcc, a large-scale panoramic stereo-based occupancy dataset tailored for humanoid robots. The dataset encompasses 15 diverse simulated indoor scenes and 5 real-world environments, yielding over 155K samples with broad scene and style diversity. Importantly, the dataset is designed around a Real2Sim2Real closed-loop paradigm: real sensor specifications drive physically accurate simulation, simulation produces large-scale annotated training data, and models trained in simulation are directly evaluated on real-world captures -- enabling iterative refinement of the sim-to-real pipeline. We further propose \textbf{H}umanoid \textbf{S}urround \textbf{S}tereo-guided \textbf{Occ}upancy model (Humanoid-OmniOcc) that exploits robust depth priors for accurate 2D-to-3D lifting. Extensive experiments show that Humanoid-OmniOcc consistently outperforms monocular baselines and generalizes well to both unseen simulated test scenes and real-world environments, validating the effectiveness of the Real2Sim2Real design. Code and data will be available upon acceptance at https://d-robotics-ai-lab.github.io/humanoid-omniocc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce Humanoid-OmniOcc, a large-scale panoramic stereo-based occupancy dataset for humanoid robots comprising 15 simulated indoor scenes and 5 real-world environments with over 155K samples. It follows a Real2Sim2Real closed-loop paradigm in which real sensor specifications drive physically accurate simulation to generate annotated training data, with models trained in simulation then evaluated directly on real captures. The authors also propose the Humanoid Surround Stereo-guided Occupancy model that exploits robust depth priors for 2D-to-3D lifting. Extensive experiments are reported to show consistent outperformance over monocular baselines together with good generalization to unseen simulated test scenes and real-world environments, thereby validating the Real2Sim2Real design.

Significance. If the quantitative results and sim-to-real transfer claims hold, the work would supply a valuable humanoid-centric occupancy resource that addresses the forward-facing, far-field, and road-prior biases of existing autonomous-driving datasets. The closed-loop paradigm and the explicit commitment to release code and data would further strengthen reproducibility and enable iterative refinement of sim-to-real pipelines for embodied perception.

major comments (2)
  1. [Experiments] Experiments section: The central claim that the Real2Sim2Real design is validated by successful generalization to real-world captures rests on the unverified assumption that simulated stereo observations (noise, calibration errors, matching failures) statistically match the real sensor. No quantitative fidelity checks—such as depth-distribution KL divergence, disparity-error histograms, or calibration-residual comparisons—are reported between the simulated and real data in the five environments.
  2. [Abstract and §4] Abstract and §4: The statement that Humanoid-OmniOcc 'consistently outperforms monocular baselines' is presented without any numerical results, error bars, specific metrics (e.g., IoU, mIoU), baseline implementations, or statistical significance tests, preventing assessment of whether the empirical support is load-bearing for the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim that the Real2Sim2Real design is validated by successful generalization to real-world captures rests on the unverified assumption that simulated stereo observations (noise, calibration errors, matching failures) statistically match the real sensor. No quantitative fidelity checks—such as depth-distribution KL divergence, disparity-error histograms, or calibration-residual comparisons—are reported between the simulated and real data in the five environments.

    Authors: We agree that the manuscript does not report quantitative fidelity checks between simulated and real stereo observations. To address this, we will incorporate depth-distribution KL divergence, disparity-error histograms, and calibration-residual comparisons for the five real environments in a revised Experiments section. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4: The statement that Humanoid-OmniOcc 'consistently outperforms monocular baselines' is presented without any numerical results, error bars, specific metrics (e.g., IoU, mIoU), baseline implementations, or statistical significance tests, preventing assessment of whether the empirical support is load-bearing for the generalization claim.

    Authors: We agree that the abstract summarizes results at a high level without numbers. We will revise §4 to explicitly present numerical results (including IoU and mIoU with error bars), baseline implementation details, and any statistical significance tests, and will update the abstract to reference key quantitative findings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset and evaluation are self-contained

full rationale

The paper introduces a dataset via Real2Sim2Real design and evaluates a proposed model through direct experiments on real-world captures, showing outperformance versus monocular baselines. No equations, parameter fits, or derivations are present in the provided text. Claims rest on empirical measurements rather than reducing by construction to inputs, self-citations, or renamed known results. The sim-to-real transfer is an empirical assumption tested by real-data performance, not a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no specific free parameters, axioms, or invented entities identifiable without full text. The Real2Sim2Real paradigm relies on standard assumptions in simulation-to-real transfer but none are detailed here.

pith-pipeline@v0.9.1-grok · 5817 in / 1318 out tokens · 31142 ms · 2026-06-26T08:26:49.219222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

234 extracted references · 5 linked inside Pith

  1. [1]

    Behley, M

    J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences . In ICCV, 2019

  2. [2]

    Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving

    Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In ICCV, 2023

  3. [3]

    Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception

    Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In ICCV, 2023 a

  4. [4]

    Humanoid occupancy: Enabling a generalized multimodal occupancy perception system on humanoid robots, 2025

    Wei Cui, Haoyu Wang, Wenkang Qin, Yijie Guo, Gang Han, Wen Zhao, Jiahang Cao, Zhang Zhang, Jiaru Zhong, Jingkai Sun, Pihai Sun, Shuai Shi, Botuo Jiang, Jiahao Ma, Jiaxu Wang, Hao Cheng, Zhichao Liu, Yang Wang, Zheng Zhu, Guan Huang, Jian Tang, and Qiang Zhang. Humanoid occupancy: Enabling a generalized multimodal occupancy perception system on humanoid ro...

  5. [7]

    Lightstereo: Channel boost is all you need for efficient 2d cost aggregation

    Xianda Guo, Chenming Zhang, Youmin Zhang, Wenzhao Zheng, Dujun Nie, Matteo Poggi, and Long Chen. Lightstereo: Channel boost is all you need for efficient 2d cost aggregation. In ICRA, 2025 a

  6. [8]

    Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d

    Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. TPAMI, 2022

  7. [9]

    Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving, 2023

    Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving, 2023. URL https://arxiv.org/abs/2304.14365

  8. [10]

    Stereovoxelnet: Real-time obstacle detection based on occupancy voxels from a stereo camera using deep neural networks

    Hongyu Li, Zhengang Li, Neset Unver Akmandor, Huaizu Jiang, Yanzhi Wang, and Taskin Padir. Stereovoxelnet: Real-time obstacle detection based on occupancy voxels from a stereo camera using deep neural networks. In ICRA, 2023 a

  9. [11]

    Wildocc: A benchmark for off-road 3d semantic occupancy prediction, 2024

    Heng Zhai, Jilin Mei, Chen Min, Liang Chen, Fangzhou Zhao, and Yu Hu. Wildocc: A benchmark for off-road 3d semantic occupancy prediction, 2024. URL https://arxiv.org/abs/2410.15792

  10. [12]

    Omnihd-scenes: A next-generation multimodal dataset for autonomous driving, 2025

    Lianqing Zheng, Long Yang, Qunshu Lin, Wenjin Ai, Minghao Liu, Shouyi Lu, Jianan Liu, Hongze Ren, Jingyue Mo, Xiaokai Bai, Jie Bai, Zhixiong Ma, and Xichan Zhu. Omnihd-scenes: A next-generation multimodal dataset for autonomous driving, 2025. URL https://arxiv.org/abs/2412.10734

  11. [13]

    A synthetic benchmark for collaborative 3d semantic occupancy prediction in v2x autonomous driving, 2025 a

    Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, and Manabu Tsukada. A synthetic benchmark for collaborative 3d semantic occupancy prediction in v2x autonomous driving, 2025 a . URL https://arxiv.org/abs/2506.17004

  12. [14]

    Event-aided semantic scene completion, 2025 b

    Shangwei Guo, Hao Shi, Song Wang, Xiaoting Yin, Kailun Yang, and Kaiwei Wang. Event-aided semantic scene completion, 2025 b . URL https://arxiv.org/abs/2502.02334

  13. [15]

    Advancing off-road autonomous driving: The large-scale orad-3d dataset and comprehensive benchmarks, 2025

    Chen Min, Jilin Mei, Heng Zhai, Shuai Wang, Tong Sun, Fanjie Kong, Haoyang Li, Fangyuan Mao, Fuyang Liu, Shuo Wang, Yiming Nie, Qi Zhu, Liang Xiao, Dawei Zhao, and Yu Hu. Advancing off-road autonomous driving: The large-scale orad-3d dataset and comprehensive benchmarks, 2025. URL https://arxiv.org/abs/2510.16500

  14. [16]

    Chang, Manolis Savva, and Thomas Funkhouser

    Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017

  15. [17]

    Monocular occupancy prediction for scalable indoor scenes

    Hongxiao Yu, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Monocular occupancy prediction for scalable indoor scenes. In ECCV, 2024

  16. [18]

    Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai, 2023 b

    Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai, 2023 b . URL https://arxiv.org/abs/2312.16170

  17. [19]

    Embodiedocc: Embodied 3d occupancy prediction for vision-based online scene understanding

    Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, and Jiwen Lu. Embodiedocc: Embodied 3d occupancy prediction for vision-based online scene understanding. In ICCV, 2025 b

  18. [20]

    Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai

    Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In CVPR, 2024 a

  19. [21]

    Monoscene: Monocular 3d semantic scene completion

    Anh-Quan Cao and Raoul De Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, 2022

  20. [22]

    Tri-perspective view for vision-based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, 2023

  21. [24]

    Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin

    Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zongdai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023

  22. [25]

    Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction. In ECCV, 2024

  23. [26]

    Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction

    Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction. In CVPR, 2025

  24. [27]

    Embodiedocc: Embodied 3d occupancy prediction for vision-based online scene understanding

    Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, and Jiwen Lu. Embodiedocc: Embodied 3d occupancy prediction for vision-based online scene understanding. In ICCV, 2025 c

  25. [28]

    Occfusion: Multi-sensor fusion framework for 3d semantic occupancy prediction

    Zhenxing Ming, Julie Stephany Berrio, Mao Shan, and Stewart Worrall. Occfusion: Multi-sensor fusion framework for 3d semantic occupancy prediction. TIV, 2024

  26. [29]

    Pyramid stereo matching network

    Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In CVPR, 2018

  27. [30]

    Group-wise correlation stereo network

    Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In CVPR, 2019

  28. [31]

    Hierarchical neural architecture search for deep stereo matching

    Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and Zongyuan Ge. Hierarchical neural architecture search for deep stereo matching. In NeurIPS, 2020

  29. [32]

    Attention concatenation volume for accurate and efficient stereo matching

    Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Attention concatenation volume for accurate and efficient stereo matching. In CVPR, 2022

  30. [33]

    Iterative geometry encoding volume for stereo matching

    Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In CVPR, 2023 a

  31. [34]

    Igev++: iterative multi-range geometry encoding volumes for stereo matching

    Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Junda Cheng, Chunyuan Liao, and Xin Yang. Igev++: iterative multi-range geometry encoding volumes for stereo matching. TPAMI, 2025

  32. [35]

    Accurate and efficient stereo matching via attention concatenation volume

    Gangwei Xu, Yun Wang, Junda Cheng, Jinhui Tang, and Xin Yang. Accurate and efficient stereo matching via attention concatenation volume. TPAMI, 2023 b

  33. [36]

    Selective-stereo: Adaptive frequency information selection for stereo matching

    Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. In CVPR, 2024 b

  34. [37]

    Correlate-and-excite: Real-time stereo matching via guided cost volume excitation

    Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Kweon, Kyung-Soo Kim, and Soohyun Kim. Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In IROS, 2021

  35. [38]

    FADNet : A fast and accurate network for disparity estimation

    Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. FADNet : A fast and accurate network for disparity estimation. In ICRA, 2020

  36. [39]

    Deeppruner: Learning efficient stereo matching via differentiable patchmatch

    Shivam Duggal, Shenlong Wang, Wei-Chiu Ma, Rui Hu, and Raquel Urtasun. Deeppruner: Learning efficient stereo matching via differentiable patchmatch. In ICCV, 2019

  37. [40]

    Mobilestereonet: Towards lightweight deep networks for stereo matching

    Faranak Shamsafar, Samuel Woerz, Rafia Rahim, and Andreas Zell. Mobilestereonet: Towards lightweight deep networks for stereo matching. In WACV, 2022

  38. [41]

    Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction

    Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, and Shahram Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. In ECCV, 2018

  39. [42]

    Match-stereo-videos: Bidirectional alignment for consistent dynamic stereo matching

    Junpeng Jing, Ye Mao, and Krystian Mikolajczyk. Match-stereo-videos: Bidirectional alignment for consistent dynamic stereo matching. In ECCV, 2024 a

  40. [45]

    Foundationstereo: Zero-shot stereo matching

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. In CVPR, 2025

  41. [46]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024 a

  42. [49]

    Stereoscene: Bev-assisted stereo matching empowers 3d semantic scene completion

    Bohan Li, Yasheng Sun, Xin Jin, Wenjun Zeng, Zheng Zhu, Xiaoefeng Wang, Yunpeng Zhang, James Okae, Hang Xiao, and Dalong Du. Stereoscene: Bev-assisted stereo matching empowers 3d semantic scene completion. In IJCAI, 2024

  43. [50]

    Cvt-occ: Cost volume temporal fusion for 3d occupancy prediction

    Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, and Hang Zhao. Cvt-occ: Cost volume temporal fusion for 3d occupancy prediction. In ECCV, 2024

  44. [52]

    arXiv preprint arXiv:2312.00343 , year=

    Openstereo: A comprehensive benchmark for stereo matching and strong baseline , author=. arXiv preprint arXiv:2312.00343 , year=

  45. [53]

    arXiv preprint arXiv:2411.14053 , year=

    Stereo anything: Unifying stereo matching with large-scale mixed data , author=. arXiv preprint arXiv:2411.14053 , year=

  46. [54]

    ICRA , year=

    Lightstereo: Channel boost is all you need for efficient 2d cost aggregation , author=. ICRA , year=

  47. [55]

    IROS , year=

    A simple baseline for supervised surround-view depth estimation , author=. IROS , year=

  48. [56]

    ECCV , year=

    Diffusiondepth: Diffusion denoising approach for monocular depth estimation , author=. ECCV , year=

  49. [57]

    CVPR , year=

    Completionformer: Depth completion with convolutions and vision transformers , author=. CVPR , year=

  50. [58]

    3DV , year=

    Monovit: Self-supervised monocular depth estimation with a vision transformer , author=. 3DV , year=

  51. [59]

    arXiv preprint arXiv:2204.05088 , year=

    M2BEV: Multi-camera joint 3D detection and segmentation with unified birds-eye view representation , author=. arXiv preprint arXiv:2204.05088 , year=

  52. [60]

    ECCV , year=

    Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching , author=. ECCV , year=

  53. [61]

    arXiv preprint arXiv:2409.20283 , year=

    Match stereo videos via bidirectional alignment , author=. arXiv preprint arXiv:2409.20283 , year=

  54. [62]

    arXiv preprint arXiv:2503.05549 , year=

    Stereo Any Video: Temporally Consistent Stereo Matching , author=. arXiv preprint arXiv:2503.05549 , year=

  55. [63]

    CoRL , year=

    Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation , author=. CoRL , year=

  56. [64]

    arXiv preprint arXiv:2505.16148 , year=

    NAN: A Training-Free Solution to Coefficient Estimation in Model Merging , author=. arXiv preprint arXiv:2505.16148 , year=

  57. [65]

    arXiv preprint arXiv:2505.12082 , year=

    Model Merging in Pre-training of Large Language Models , author=. arXiv preprint arXiv:2505.12082 , year=

  58. [66]

    ECCV , year=

    Training-free model merging for multi-target domain adaptation , author=. ECCV , year=

  59. [67]

    CVPR , year=

    Defom-stereo: Depth foundation model based stereo matching , author=. CVPR , year=

  60. [68]

    CVPR , year=

    MonSter: Marry Monodepth to Stereo Unleashes Power , author=. CVPR , year=

  61. [69]

    CVPR , year=

    FoundationStereo: Zero-Shot Stereo Matching , author=. CVPR , year=

  62. [70]

    CVPR , year=

    Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail , author=. CVPR , year=

  63. [71]

    TPAMI , year=

    IGEV++: iterative multi-range geometry encoding volumes for stereo matching , author=. TPAMI , year=

  64. [72]

    ICCV , year=

    Swin transformer: Hierarchical vision transformer using shifted windows , author=. ICCV , year=

  65. [73]

    ACM TOG , year=

    Realfill: Reference-driven generation for authentic image completion , author=. ACM TOG , year=

  66. [74]

    TPAMI , year=

    Booster: a benchmark for depth from images of specular and transparent surfaces , author=. TPAMI , year=

  67. [75]

    CVPR , year =

    Yu, Fisher and Chen, Haofeng and Wang, Xin and Xian, Wenqi and Chen, Yingying and Liu, Fangchen and Madhavan, Vashisht and Darrell, Trevor , title =. CVPR , year =

  68. [76]

    and Araujo, A

    Weyand, T. and Araujo, A. and Cao, B. and Sim, J. , title =. 2020 , booktitle =

  69. [77]

    IJCV , year=

    Imagenet large scale visual recognition challenge , author=. IJCV , year=

  70. [78]

    arXiv preprint arXiv:1506.03365 , year=

    Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop , author=. arXiv preprint arXiv:1506.03365 , year=

  71. [79]

    TPAMI , year=

    Places: A 10 million image database for scene recognition , author=. TPAMI , year=

  72. [80]

    CVPR , year=

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation , author=. CVPR , year=

  73. [81]

    CVPR , year=

    Are we ready for autonomous driving? the kitti vision benchmark suite , author=. CVPR , year=

  74. [82]

    CVPR , year=

    Object scene flow for autonomous vehicles , author=. CVPR , year=

  75. [83]

    GCPR , year=

    High-resolution stereo datasets with subpixel-accurate ground truth , author=. GCPR , year=

  76. [84]

    CVPR , year=

    A multi-view stereo benchmark with high-resolution images and multi-camera videos , author=. CVPR , year=

  77. [85]

    CVPR , year=

    DrivingStereo: A Large-Scale Dataset for Stereo Matching in Autonomous Driving Scenarios , author=. CVPR , year=

  78. [86]

    CVPRW , pages=

    Falling things: A synthetic dataset for 3d object detection and pose estimation , author=. CVPRW , pages=

  79. [87]

    Science China Information Sciences , year=

    Instereo2k: a large real dataset for stereo matching in indoor scenes , author=. Science China Information Sciences , year=

  80. [88]

    ECCV , year=

    A naturalistic open source movie for optical flow evaluation , author=. ECCV , year=

Showing first 80 references.