SpatialBench: Is Your Spatial Foundation Model an All-Round Player?
Pith reviewed 2026-06-29 17:46 UTC · model grok-4.3
The pith
Spatial foundation models are not yet all-round players that generalize across tasks, viewpoints, domains, densities, and hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpatialBench shows that current spatial foundation models cannot generalize robustly across the tested conditions; full-context attention maximizes accuracy while bounded-memory strategies enable long-sequence handling, and domain alignment plus high data quality outperform simple scaling in embodied and egocentric settings.
What carries the argument
SpatialBench, a cross-paradigm benchmark of 19 datasets and 546 scenes with deterministic sampling across 5 domains, 6 paradigms, 5 task suites, and 4 input densities.
If this is right
- Full-context attention should be used when accuracy is the priority.
- Bounded-memory attention mechanisms can scale models to longer sequences.
- Training data should prioritize domain alignment and quality over volume.
- Embodied and egocentric tasks require specialized handling beyond standard scaling.
- The released DA-Next-5M dataset and DA-Next model provide a stronger starting point for future spatial representation learning.
Where Pith is reading between the lines
- Spatial models may need separate pathways for egocentric versus third-person data rather than unified scaling.
- Hardware-specific constraints could be tested directly on physical robots to confirm the benchmark's density and memory findings.
- The emphasis on domain alignment suggests synthetic pre-training alone may remain insufficient without targeted real-world fine-tuning.
- Future benchmarks could add explicit viewpoint randomization schedules to measure generalization more precisely than the current fixed sampling.
Load-bearing premise
The 19 datasets, 5 domains, 6 paradigms, and 5 task suites in SpatialBench sufficiently represent real-world spatial generalization challenges.
What would settle it
Discovery of one model that ranks at or near the top on every task suite under all four input densities and across all five domains would falsify the central claim.
read the original abstract
While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpatialBench, a cross-paradigm benchmark comprising 19 datasets, 546 scenes across 5 spatial domains, 6 paradigms, and 5 task suites, with deterministic sampling and evaluations of 41 models under 4 input-density settings. It concludes that current spatial foundation models are not all-round players capable of robust generalization across tasks, viewpoints, domains, densities, and hardware constraints. The work also releases DA-Next-5M (a 5M-scale dataset) and the DA-Next baseline model to address the largest identified data gap, while reporting empirical findings on full-context attention versus bounded-memory strategies and the primacy of domain alignment over dataset scaling.
Significance. If the benchmark's coverage is shown to be representative, the work supplies a large-scale, multi-paradigm evaluation that could usefully quantify current limitations in spatial foundation models and highlight actionable directions (attention mechanisms, data quality). The release of DA-Next-5M and a competitive baseline constitutes a concrete resource contribution. The deterministic-sampling design and scale (546 scenes) are strengths that distinguish it from prior narrower evaluations.
major comments (2)
- [§3 (Benchmark Construction) and §1 (Introduction)] The central claim that 'current models are not yet all-round players' (abstract, §1, conclusion) rests on SpatialBench being a sufficiently complete proxy for arbitrary viewpoints, shifting domains, varying densities, and hardware constraints. The manuscript describes the 19 datasets and 5 domains but does not supply an explicit coverage analysis, selection criteria, or bias audit demonstrating that the chosen scenes exhaustively sample the relevant variation space; without this, failures on SpatialBench do not necessarily entail the broader negative generalization claim.
- [Table 2 and §5.3] Table 2 and the embodied/egocentric results (§5.3): the reported performance gaps are used to argue that 'strict domain alignment and high data quality are far more critical than simple dataset scaling.' However, the paper does not report an ablation that isolates data quality from domain overlap or controls for model capacity, leaving the causal interpretation under-supported.
minor comments (2)
- [§3.2] The deterministic sampling procedure is described in prose but would benefit from an explicit algorithm box or pseudocode to allow exact reproduction.
- [Figure 4] Figure 4 (input-density ablations) uses inconsistent y-axis scaling across subplots, making visual comparison of relative drops difficult.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§3 (Benchmark Construction) and §1 (Introduction)] The central claim that 'current models are not yet all-round players' (abstract, §1, conclusion) rests on SpatialBench being a sufficiently complete proxy for arbitrary viewpoints, shifting domains, varying densities, and hardware constraints. The manuscript describes the 19 datasets and 5 domains but does not supply an explicit coverage analysis, selection criteria, or bias audit demonstrating that the chosen scenes exhaustively sample the relevant variation space; without this, failures on SpatialBench do not necessarily entail the broader negative generalization claim.
Authors: We acknowledge that an explicit coverage analysis and bias audit are not provided in the current manuscript. The selection of the 19 datasets across 5 domains was guided by the goal of covering diverse spatial paradigms and real-world scenarios, as described in §3, with deterministic sampling to mitigate arbitrary frame selection issues. While we do not assert that SpatialBench exhaustively samples the entire variation space, the consistent underperformance across multiple dimensions supports the claim that models are not all-round players in the contexts evaluated. To strengthen this, we will add a subsection in §3 detailing the selection criteria, domain coverage rationale, and a brief discussion of potential limitations in representativeness. revision: yes
-
Referee: [Table 2 and §5.3] Table 2 and the embodied/egocentric results (§5.3): the reported performance gaps are used to argue that 'strict domain alignment and high data quality are far more critical than simple dataset scaling.' However, the paper does not report an ablation that isolates data quality from domain overlap or controls for model capacity, leaving the causal interpretation under-supported.
Authors: The referee correctly notes the absence of controlled ablations. Our arguments in §5.3 are based on comparative evaluations of existing models with differing training data characteristics and scales. The results indicate that models benefiting from strict domain alignment and high-quality data outperform those relying primarily on scale, even under challenging embodied and egocentric settings. We agree this is observational evidence rather than causal proof from isolated ablations. We will revise the text in §5.3 to explicitly state the correlational nature of these findings and note that future work could include controlled experiments varying data quality while holding other factors constant. revision: yes
Circularity Check
Empirical benchmark evaluation with no circular derivations
full rationale
The paper introduces SpatialBench as a new cross-paradigm benchmark comprising 19 datasets and evaluates 41 existing models across tasks, domains, and input settings. All claims rest on direct empirical results from these evaluations rather than any derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or self-referential definitions appear in the provided text; the conclusion that models are not all-round players follows from observed performance gaps on the benchmark, which is externally constructed and not reduced to its own inputs by construction. This is a standard self-contained benchmark study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 19 datasets and 5 domains adequately represent the space of spatial tasks and generalization challenges
invented entities (3)
-
SpatialBench
no independent evidence
-
DA-Next-5M
no independent evidence
-
DA-Next
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulo, Yubin Kuang, and Peter Kontschieder. 2020. Mapillary planet-scale depth dataset. InEuropean Conference on Computer Vision, pages 589–604. Springer
2020
-
[2]
Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Áron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. 2022. Map-free visual relocalization: Metric pose relative to a single image. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I, volume 13661 ...
2022
-
[3]
Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, and 1 others. 2024. Scenescript: Reconstructing scenes with an autoregressive structured language model. InEuropean Conference on Computer Vision, pages 247–263. Springer
2024
-
[4]
Dejan Azinovi´ c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. 2022. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301
2022
-
[5]
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and 1 others. 2021. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Zuria Bauer, Francisco Gomez-Donoso, Edmanuel Cruz, Sergio Orts-Escolano, and Miguel Cazorla. 2019. Uasol, a large-scale high-resolution outdoor stereo dataset.Scientific data, 6(1):162
2019
-
[7]
Michael J Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. 2023. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737
2023
-
[8]
Yohann Cabon, Naila Murray, and Martin Humenberger. 2020. Virtual KITTI 2.CoRR, abs/2001.10773
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[9]
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. 2025. Must3r: Multi-view network for stereo 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1050–1060
2025
-
[10]
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, and 19 others. 2025. Sam 3: Segment anything with concepts.Preprint, arXiv:2511.16719
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and 1 others. 2019. Argoverse: 3d tracking and forecasting with rich maps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8748–8757
2019
-
[13]
Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, and Yinghao Xu. 2026. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, and 1 others. 2025. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [15]
-
[16]
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. 2025. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [17]
- [18]
- [19]
-
[20]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223
2016
-
[21]
Chang, Manolis Savva, Maciej Halber, Thomas A
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2432–2443. IEEE Computer Society
2017
-
[22]
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. 2023. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153
2023
-
[23]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. 2025. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.Preprint, arXiv:2507.16443
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Michael Fonder and Marc Van Droogenbroeck. 2019. Mid-air: A multi-modal dataset for extremely low altitude drone flights. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0
2019
-
[25]
Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. 2021. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954
2021
-
[26]
Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE
2012
-
[27]
Yotam Gil, Shay Elmalem, Harel Haim, Emanuel Marom, and Raja Giryes. 2021. Online training of stereo self-calibration using monocular depth estimation.IEEE Transactions on Computational Imaging, 7:812–823
2021
-
[28]
Jose L Gómez, Manuel Silva, Antonio Seoane, Agnès Borrás, Mario Noriega, Germán Ros, Jose A Iglesias-Guitian, and Antonio M López. 2025. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes.Neurocomputing, 637:130038
2025
-
[29]
Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, and 1 others. 2022. Kubric: A scalable dataset generator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761
2022
-
[30]
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self- supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494
2020
-
[31]
John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. 2021. One thousand and one hours: Self-driving motion prediction dataset. In Conference on Robot Learning, pages 409–418. PMLR
2021
-
[32]
Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexander G Schwing. 2021. Sail-vos 3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1418–1428
2021
-
[33]
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler
-
[34]
ViPE: Video Pose Engine for 3D Geometric Perception
Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. 2018. Deepmvs: Learning multi-view stereopsis. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2821–2830. Computer Vision Foundation / IEEE Computer Society
2018
-
[36]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, and 1 others. 2025. 𝜋0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. 2020. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026
2020
-
[38]
Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. 2014. Large scale multi-view stereopsis evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413. 17
2014
-
[39]
Hanwen Jiang, Zexiang Xu, Desai Xie, Ziwen Chen, Haian Jin, Fujun Luan, Zhixin Shu, Kai Zhang, Sai Bi, Xin Sun, and 1 others. 2025. Megasynth: Scaling up 3d scene reconstruction with synthesized data. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16441–16452
2025
-
[40]
Barron, Noah Snavely, and Aleksander Holynski
Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski
-
[41]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
ZipMap: Linear-time stateful 3d reconstruction via test-time training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
-
[42]
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht
-
[43]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13229–13239
Dynamicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13229–13239
-
[44]
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. 2025. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE
2025
-
[45]
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. 2026. MapAnything: Universal feed-forward metric 3D reconstructio...
2026
-
[46]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, and 1 others. 2024. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4)
2017
-
[48]
Anastasiia Kornilova, Marsel Faizullin, Konstantin Pakulev, Andrey Sadkov, Denis Kukushkin, Azat Akhmetyanov, Timur Akhtyamov, Hekmat Taherinejad, and Gonzalo Ferrer. 2022. Smartportraits: Depth powered handheld smartphone dataset of human portraits for state estimation, reconstruction and synthesis. InProceedings of the IEEE/CVF Conference on Computer Vi...
2022
-
[49]
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. 2026. STream3R: Scalable sequential 3D reconstruction with causal trans- former. InICLR
2026
-
[50]
Hoang-An Le, Thomas Mensink, Partha Das, Sezer Karaoglu, and Theo Gevers. 2021. Eden: Multimodal synthetic dataset of enclosed garden scenes. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1579–1589
2021
-
[51]
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. 2024. Grounding image matching in 3d with mast3r. In Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXII, volume 15130 ofLecture Notes in Computer Science, pages 71–91. Springer
2024
-
[52]
Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. 2023. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215
2023
-
[53]
Zhengqi Li and Noah Snavely. 2018. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050
2018
-
[54]
Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. 2025. MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2025
- [55]
-
[56]
Yiyi Liao, Jun Xie, and Andreas Geiger. 2022. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310
2022
-
[57]
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. 2025. Prompting depth anything for 4k resolution accurate metric depth estimation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17070–17080. 18
2025
- [59]
-
[60]
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, and 1 others. 2024. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169
2024
-
[61]
Minghuan Liu, Zhengbang Zhu, Xiaoshen Han, Peng Hu, Haotong Lin, Xinyao Li, Jingxiao Chen, Jiafeng Xu, Yichu Yang, Yunfeng Lin, Xinghang Li, Yong Yu, Weinan Zhang, Tao Kong, and Bingyi Kang. 2025. Manipulation as in simulation: Enabling accurate geometry perception in robots.arXiv preprint
2025
-
[62]
Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo
- [63]
-
[64]
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. 2022. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022
2022
- [65]
-
[66]
Dominic Maggio, Hyungtae Lim, and Luca Carlone. 2025. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.Advances in Neural Information Processing Systems, 39
2025
-
[67]
John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison. 2016. Scenenet rgb-d: 5m photoreal- istic images of synthetic indoor trajectories with ground truth.arXiv preprint arXiv:1612.05079
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[68]
Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. 2023. Spring: A high- resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4991
2023
-
[69]
Junhong Min, Youngpil Jeon, Jimin Kim, and Minyong Choi. 2025. S2M2: Scalable stereo matching model for reliable depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
2025
-
[70]
Riku Murai, Eric Dexheimer, and Andrew J. Davison. 2025. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
2025
-
[71]
Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 2019. 3d ken burns effect from a single image.ACM Transactions on Graphics (ToG), 38(6):1–15
2019
-
[72]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernan- dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, and 7 others. 2024. Dinov2: Learning robust visual features w...
2024
-
[73]
Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. 2023. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143
2023
-
[74]
Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Sebastian Scherer, Marco Hutter, and Wenshan Wang. 2025. Tartanground: A large-scale dataset for ground robot perception and navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 20524–20531. IEEE
2025
- [75]
- [76]
-
[77]
Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, and 1 others. 2024. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794
2024
-
[78]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M. Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. 2021. Habitat-matterport 3d dataset (HM3D): 1000 large-scale 3d environments for embodied AI. CoRR, abs/2109.08238. 19
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[79]
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny
-
[80]
In Proceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911
Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.