pith. sign in

arxiv: 2605.27367 · v2 · pith:XKIDAMGZnew · submitted 2026-05-26 · 💻 cs.CV

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

Pith reviewed 2026-06-29 17:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial foundation modelsbenchmarkgeneralization3D visionembodied AIegocentric visionmulti-task evaluationattention mechanisms
0
0 comments X

The pith

Spatial foundation models are not yet all-round players that generalize across tasks, viewpoints, domains, densities, and hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SpatialBench to test whether spatial foundation models can handle diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and hardware constraints. It runs 41 models from 6 paradigms on 5 task suites drawn from 19 datasets and 546 scenes across 5 domains, using deterministic sampling and four density settings. The evaluation shows no model performs robustly everywhere. Additional tests highlight that full-context attention improves accuracy, bounded-memory methods help with long sequences, and strict domain alignment plus data quality matter more than raw dataset size. To close the largest identified gap the authors release DA-Next-5M and the DA-Next baseline model.

Core claim

SpatialBench shows that current spatial foundation models cannot generalize robustly across the tested conditions; full-context attention maximizes accuracy while bounded-memory strategies enable long-sequence handling, and domain alignment plus high data quality outperform simple scaling in embodied and egocentric settings.

What carries the argument

SpatialBench, a cross-paradigm benchmark of 19 datasets and 546 scenes with deterministic sampling across 5 domains, 6 paradigms, 5 task suites, and 4 input densities.

If this is right

  • Full-context attention should be used when accuracy is the priority.
  • Bounded-memory attention mechanisms can scale models to longer sequences.
  • Training data should prioritize domain alignment and quality over volume.
  • Embodied and egocentric tasks require specialized handling beyond standard scaling.
  • The released DA-Next-5M dataset and DA-Next model provide a stronger starting point for future spatial representation learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Spatial models may need separate pathways for egocentric versus third-person data rather than unified scaling.
  • Hardware-specific constraints could be tested directly on physical robots to confirm the benchmark's density and memory findings.
  • The emphasis on domain alignment suggests synthetic pre-training alone may remain insufficient without targeted real-world fine-tuning.
  • Future benchmarks could add explicit viewpoint randomization schedules to measure generalization more precisely than the current fixed sampling.

Load-bearing premise

The 19 datasets, 5 domains, 6 paradigms, and 5 task suites in SpatialBench sufficiently represent real-world spatial generalization challenges.

What would settle it

Discovery of one model that ranks at or near the top on every task suite under all four input densities and across all five domains would falsify the central claim.

read the original abstract

While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpatialBench, a cross-paradigm benchmark comprising 19 datasets, 546 scenes across 5 spatial domains, 6 paradigms, and 5 task suites, with deterministic sampling and evaluations of 41 models under 4 input-density settings. It concludes that current spatial foundation models are not all-round players capable of robust generalization across tasks, viewpoints, domains, densities, and hardware constraints. The work also releases DA-Next-5M (a 5M-scale dataset) and the DA-Next baseline model to address the largest identified data gap, while reporting empirical findings on full-context attention versus bounded-memory strategies and the primacy of domain alignment over dataset scaling.

Significance. If the benchmark's coverage is shown to be representative, the work supplies a large-scale, multi-paradigm evaluation that could usefully quantify current limitations in spatial foundation models and highlight actionable directions (attention mechanisms, data quality). The release of DA-Next-5M and a competitive baseline constitutes a concrete resource contribution. The deterministic-sampling design and scale (546 scenes) are strengths that distinguish it from prior narrower evaluations.

major comments (2)
  1. [§3 (Benchmark Construction) and §1 (Introduction)] The central claim that 'current models are not yet all-round players' (abstract, §1, conclusion) rests on SpatialBench being a sufficiently complete proxy for arbitrary viewpoints, shifting domains, varying densities, and hardware constraints. The manuscript describes the 19 datasets and 5 domains but does not supply an explicit coverage analysis, selection criteria, or bias audit demonstrating that the chosen scenes exhaustively sample the relevant variation space; without this, failures on SpatialBench do not necessarily entail the broader negative generalization claim.
  2. [Table 2 and §5.3] Table 2 and the embodied/egocentric results (§5.3): the reported performance gaps are used to argue that 'strict domain alignment and high data quality are far more critical than simple dataset scaling.' However, the paper does not report an ablation that isolates data quality from domain overlap or controls for model capacity, leaving the causal interpretation under-supported.
minor comments (2)
  1. [§3.2] The deterministic sampling procedure is described in prose but would benefit from an explicit algorithm box or pseudocode to allow exact reproduction.
  2. [Figure 4] Figure 4 (input-density ablations) uses inconsistent y-axis scaling across subplots, making visual comparison of relative drops difficult.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3 (Benchmark Construction) and §1 (Introduction)] The central claim that 'current models are not yet all-round players' (abstract, §1, conclusion) rests on SpatialBench being a sufficiently complete proxy for arbitrary viewpoints, shifting domains, varying densities, and hardware constraints. The manuscript describes the 19 datasets and 5 domains but does not supply an explicit coverage analysis, selection criteria, or bias audit demonstrating that the chosen scenes exhaustively sample the relevant variation space; without this, failures on SpatialBench do not necessarily entail the broader negative generalization claim.

    Authors: We acknowledge that an explicit coverage analysis and bias audit are not provided in the current manuscript. The selection of the 19 datasets across 5 domains was guided by the goal of covering diverse spatial paradigms and real-world scenarios, as described in §3, with deterministic sampling to mitigate arbitrary frame selection issues. While we do not assert that SpatialBench exhaustively samples the entire variation space, the consistent underperformance across multiple dimensions supports the claim that models are not all-round players in the contexts evaluated. To strengthen this, we will add a subsection in §3 detailing the selection criteria, domain coverage rationale, and a brief discussion of potential limitations in representativeness. revision: yes

  2. Referee: [Table 2 and §5.3] Table 2 and the embodied/egocentric results (§5.3): the reported performance gaps are used to argue that 'strict domain alignment and high data quality are far more critical than simple dataset scaling.' However, the paper does not report an ablation that isolates data quality from domain overlap or controls for model capacity, leaving the causal interpretation under-supported.

    Authors: The referee correctly notes the absence of controlled ablations. Our arguments in §5.3 are based on comparative evaluations of existing models with differing training data characteristics and scales. The results indicate that models benefiting from strict domain alignment and high-quality data outperform those relying primarily on scale, even under challenging embodied and egocentric settings. We agree this is observational evidence rather than causal proof from isolated ablations. We will revise the text in §5.3 to explicitly state the correlational nature of these findings and note that future work could include controlled experiments varying data quality while holding other factors constant. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no circular derivations

full rationale

The paper introduces SpatialBench as a new cross-paradigm benchmark comprising 19 datasets and evaluates 41 existing models across tasks, domains, and input settings. All claims rest on direct empirical results from these evaluations rather than any derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or self-referential definitions appear in the provided text; the conclusion that models are not all-round players follows from observed performance gaps on the benchmark, which is externally constructed and not reduced to its own inputs by construction. This is a standard self-contained benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim that models are not all-round players rests on the domain assumption that the chosen datasets adequately sample the space of spatial tasks; the paper also introduces three new artifacts (benchmark, dataset, model) with no independent evidence outside this work.

axioms (1)
  • domain assumption The 19 datasets and 5 domains adequately represent the space of spatial tasks and generalization challenges
    This premise is required to extrapolate from benchmark results to the claim that current models are not all-round players.
invented entities (3)
  • SpatialBench no independent evidence
    purpose: Cross-paradigm benchmark for spatial foundation models
    New benchmark introduced by the paper.
  • DA-Next-5M no independent evidence
    purpose: Large-scale dataset to address identified data gap
    New dataset introduced by the paper.
  • DA-Next no independent evidence
    purpose: Strong baseline model for spatial representation learning
    New model introduced by the paper.

pith-pipeline@v0.9.1-grok · 5853 in / 1480 out tokens · 54145 ms · 2026-06-29T17:46:41.243111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

143 extracted references · 47 canonical work pages · 26 internal anchors

  1. [1]

    Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulo, Yubin Kuang, and Peter Kontschieder. 2020. Mapillary planet-scale depth dataset. InEuropean Conference on Computer Vision, pages 589–604. Springer

  2. [2]

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Áron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. 2022. Map-free visual relocalization: Metric pose relative to a single image. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I, volume 13661 ...

  3. [3]

    Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, and 1 others. 2024. Scenescript: Reconstructing scenes with an autoregressive structured language model. InEuropean Conference on Computer Vision, pages 247–263. Springer

  4. [4]

    Dejan Azinovi´ c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. 2022. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301

  5. [5]

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and 1 others. 2021. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897

  6. [6]

    Zuria Bauer, Francisco Gomez-Donoso, Edmanuel Cruz, Sergio Orts-Escolano, and Miguel Cazorla. 2019. Uasol, a large-scale high-resolution outdoor stereo dataset.Scientific data, 6(1):162

  7. [7]

    Michael J Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. 2023. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737

  8. [8]

    Yohann Cabon, Naila Murray, and Martin Humenberger. 2020. Virtual KITTI 2.CoRR, abs/2001.10773

  9. [9]

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. 2025. Must3r: Multi-view network for stereo 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1050–1060

  10. [10]

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, and 19 others. 2025. Sam 3: Segment anything with concepts.Preprint, arXiv:2511.16719

  11. [11]

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158

  12. [12]

    Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and 1 others. 2019. Argoverse: 3d tracking and forecasting with rich maps. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8748–8757

  13. [13]

    Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, and Yinghao Xu. 2026. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141

  14. [14]

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, and 1 others. 2025. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088

  15. [15]

    Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, and 1 others. 2025. Dggt: Feedforward 4d reconstruction of dynamic driving scenes using unposed images.arXiv preprint arXiv:2512.03004

  16. [16]

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. 2025. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645

  17. [17]

    Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. 2026. Longstream: Long-sequence streaming autoregressive visual geometry.Preprint, arXiv:2602.13172

  18. [18]

    Jaehoon Cho, Dongbo Min, Youngjung Kim, and Kwanghoon Sohn. 2021. Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes.arXiv preprint arXiv:2110.11590. 16

  19. [19]

    Wenyan Cong, Yiqing Liang, Yancheng Zhang, Ziyi Yang, Yan Wang, Boris Ivanovic, Marco Pavone, Chen Chen, Zhangyang Wang, and Zhiwen Fan. 2025. E3d-bench: A benchmark for end-to-end 3d geometric foundation models.arXiv preprint arXiv:2506.01933

  20. [20]

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223

  21. [21]

    Chang, Manolis Savva, Maciej Halber, Thomas A

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2432–2443. IEEE Computer Society

  22. [22]

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. 2023. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153

  23. [23]

    Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. 2025. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.Preprint, arXiv:2507.16443

  24. [24]

    Michael Fonder and Marc Van Droogenbroeck. 2019. Mid-air: A multi-modal dataset for extremely low altitude drone flights. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0

  25. [25]

    Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. 2021. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954

  26. [26]

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE

  27. [27]

    Yotam Gil, Shay Elmalem, Harel Haim, Emanuel Marom, and Raja Giryes. 2021. Online training of stereo self-calibration using monocular depth estimation.IEEE Transactions on Computational Imaging, 7:812–823

  28. [28]

    Jose L Gómez, Manuel Silva, Antonio Seoane, Agnès Borrás, Mario Noriega, Germán Ros, Jose A Iglesias-Guitian, and Antonio M López. 2025. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes.Neurocomputing, 637:130038

  29. [29]

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, and 1 others. 2022. Kubric: A scalable dataset generator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761

  30. [30]

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self- supervised monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2485–2494

  31. [31]

    John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. 2021. One thousand and one hours: Self-driving motion prediction dataset. In Conference on Robot Learning, pages 409–418. PMLR

  32. [32]

    Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexander G Schwing. 2021. Sail-vos 3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1418–1428

  33. [33]

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler

  34. [34]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Vipe: Video pose engine for 3d geometric perception. InNVIDIA Research Whitepapers arXiv:2508.10934

  35. [35]

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. 2018. Deepmvs: Learning multi-view stereopsis. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2821–2830. Computer Vision Foundation / IEEE Computer Society

  36. [36]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, and 1 others. 2025. 𝜋0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054

  37. [37]

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. 2020. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026

  38. [38]

    Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. 2014. Large scale multi-view stereopsis evaluation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413. 17

  39. [39]

    Hanwen Jiang, Zexiang Xu, Desai Xie, Ziwen Chen, Haian Jin, Fujun Luan, Zhixin Shu, Kai Zhang, Sai Bi, Xin Sun, and 1 others. 2025. Megasynth: Scaling up 3d scene reconstruction with synthesized data. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16441–16452

  40. [40]

    Barron, Noah Snavely, and Aleksander Holynski

    Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski

  41. [41]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ZipMap: Linear-time stateful 3d reconstruction via test-time training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  42. [42]

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht

  43. [43]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13229–13239

    Dynamicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13229–13239

  44. [44]

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. 2025. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE

  45. [45]

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. 2026. MapAnything: Universal feed-forward metric 3D reconstructio...

  46. [46]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, and 1 others. 2024. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945

  47. [47]

    Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4)

  48. [48]

    Anastasiia Kornilova, Marsel Faizullin, Konstantin Pakulev, Andrey Sadkov, Denis Kukushkin, Azat Akhmetyanov, Timur Akhtyamov, Hekmat Taherinejad, and Gonzalo Ferrer. 2022. Smartportraits: Depth powered handheld smartphone dataset of human portraits for state estimation, reconstruction and synthesis. InProceedings of the IEEE/CVF Conference on Computer Vi...

  49. [49]

    Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. 2026. STream3R: Scalable sequential 3D reconstruction with causal trans- former. InICLR

  50. [50]

    Hoang-An Le, Thomas Mensink, Partha Das, Sezer Karaoglu, and Theo Gevers. 2021. Eden: Multimodal synthetic dataset of enclosed garden scenes. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1579–1589

  51. [51]

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. 2024. Grounding image matching in 3d with mast3r. In Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXII, volume 15130 ofLecture Notes in Computer Science, pages 71–91. Springer

  52. [52]

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. 2023. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215

  53. [53]

    Zhengqi Li and Noah Snavely. 2018. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050

  54. [54]

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. 2025. MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  55. [55]

    Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. 2025. Wint3r: Window-based streaming reconstruction with camera token pool.Preprint, arXiv:2509.05296

  56. [56]

    Yiyi Liao, Jun Xie, and Andreas Geiger. 2022. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310

  57. [57]

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647

  58. [58]

    Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. 2025. Prompting depth anything for 4k resolution accurate metric depth estimation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17070–17080. 18

  59. [59]

    Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. 2025. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416

  60. [60]

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, and 1 others. 2024. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169

  61. [61]

    Minghuan Liu, Zhengbang Zhu, Xiaoshen Han, Peng Hu, Haotong Lin, Xinyao Li, Jingxiao Chen, Jiafeng Xu, Yichu Yang, Yunfeng Lin, Xinghang Li, Yong Yu, Weinan Zhang, Tao Kong, and Bingyi Kang. 2025. Manipulation as in simulation: Enabling accurate geometry perception in robots.arXiv preprint

  62. [62]

    Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo

  63. [63]

    Worldmirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726

  64. [64]

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. 2022. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022

  65. [65]

    Dominic Maggio and Luca Carlone. 2026. Vggt-slam 2.0: Real-time dense feed-forward scene reconstruction. arXiv preprint arXiv:2601.19887

  66. [66]

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. 2025. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.Advances in Neural Information Processing Systems, 39

  67. [67]

    John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison. 2016. Scenenet rgb-d: 5m photoreal- istic images of synthetic indoor trajectories with ground truth.arXiv preprint arXiv:1612.05079

  68. [68]

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. 2023. Spring: A high- resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4991

  69. [69]

    Junhong Min, Youngpil Jeon, Jimin Kim, and Minyong Choi. 2025. S2M2: Scalable stereo matching model for reliable depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  70. [70]

    Riku Murai, Eric Dexheimer, and Andrew J. Davison. 2025. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  71. [71]

    Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 2019. 3d ken burns effect from a single image.ACM Transactions on Graphics (ToG), 38(6):1–15

  72. [72]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernan- dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, and 7 others. 2024. Dinov2: Learning robust visual features w...

  73. [73]

    Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. 2023. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143

  74. [74]

    Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Sebastian Scherer, Marco Hutter, and Wenshan Wang. 2025. Tartanground: A large-scale dataset for ground robot perception and navigation. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 20524–20531. IEEE

  75. [75]

    Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, and 1 others. 2025. Omnivggt: Omni-modality driven visual geometry grounded transformer. arXiv preprint arXiv:2511.10560

  76. [76]

    Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, and Dieter Fox. 2024. The colosseum: A benchmark for evaluating generalization for robotic manipulation.arXiv preprint arXiv:2402.08191

  77. [77]

    Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, and 1 others. 2024. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794

  78. [78]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John M. Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. 2021. Habitat-matterport 3d dataset (HM3D): 1000 large-scale 3d environments for embodied AI. CoRR, abs/2109.08238. 19

  79. [79]

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny

  80. [80]

    In Proceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911

    Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911

Showing first 80 references.