pith. machine review for the scientific record. sign in

arxiv: 2109.08238 · v1 · submitted 2021-09-16 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K. Ramakrishnan , Aaron Gokaslan , Erik Wijmans , Oleksandr Maksymets , Alex Clegg , John Turner , Eric Undersander , Wojciech Galuba , Andrew Westbury , Angel X. Chang , Manolis Savva , Yili Zhao , Dhruv Batra

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords HM3D3D datasetEmbodied AIPointGoal navigation3D reconstructionindoor environmentsHabitat simulatordataset scale
0
0 comments X

The pith

HM3D dataset of 1000 real indoor 3D scenes produces PointGoal navigation agents that achieve top performance on HM3D, Gibson, and MP3D evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Habitat-Matterport 3D dataset containing 1000 building-scale textured 3D mesh reconstructions of real-world indoor spaces such as residences and stores. HM3D provides 112.5k square meters of navigable area, 1.4 to 3.7 times larger than prior building-scale sets, along with 20 to 85 percent higher visual fidelity in rendered images and 34 to 91 percent fewer reconstruction artifacts. Agents trained for PointGoal navigation on HM3D reach the highest success rates whether tested on HM3D itself or transferred to Gibson and MP3D, establishing the dataset as pareto optimal. No other training set supports the same cross-benchmark dominance, and HM3D agents reach 100 percent success on the Gibson test split.

Core claim

HM3D is pareto optimal in the sense that agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100 percent performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.

What carries the argument

The HM3D collection of 1000 textured 3D mesh reconstructions of diverse real indoor spaces that supplies greater scale, completeness, and visual fidelity for embodied agent training.

If this is right

  • Embodied AI training pipelines can shift to HM3D as the primary source of environments because it yields superior agents on every tested benchmark.
  • Smaller datasets such as Gibson may reach saturation and become unnecessary for evaluation once agents achieve 100 percent success.
  • Increased scene diversity and fidelity in training data directly improves generalization of navigation policies across different indoor layouts.
  • Research on more complex embodied tasks can now leverage the larger navigable area and higher visual quality without immediate performance plateaus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future dataset construction for embodied AI should prioritize physical scale and surface completeness over other design choices to achieve cross-benchmark dominance.
  • The high visual fidelity may shorten the sim-to-real transfer gap when policies trained in HM3D are deployed on physical robots.
  • Benchmark suites could evolve to include cross-training evaluations as a standard test of dataset quality.
  • Larger environments open the possibility of studying long-horizon tasks that require agents to traverse multiple floors or visit distant rooms.

Load-bearing premise

The performance advantage of HM3D-trained agents arises primarily from the dataset's larger scale, reconstruction completeness, and visual fidelity rather than differences in training procedures or evaluation protocols.

What would settle it

Train identical PointGoal navigation agents on HM3D and on Gibson using the exact same procedure, then measure whether the HM3D-trained agents fail to exceed the Gibson-trained agents when both are evaluated on Gibson and MP3D test sets.

read the original abstract

We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in terms of physical scale, completeness of the reconstruction, and visual fidelity. HM3D contains 112.5k m^2 of navigable space, which is 1.4 - 3.7x larger than other building-scale datasets such as MP3D and Gibson. When compared to existing photorealistic 3D datasets such as Replica, MP3D, Gibson, and ScanNet, images rendered from HM3D have 20 - 85% higher visual fidelity w.r.t. counterpart images captured with real cameras, and HM3D meshes have 34 - 91% fewer artifacts due to incomplete surface reconstruction. The increased scale, fidelity, and diversity of HM3D directly impacts the performance of embodied AI agents trained using it. In fact, we find that HM3D is `pareto optimal' in the following sense -- agents trained to perform PointGoal navigation on HM3D achieve the highest performance regardless of whether they are evaluated on HM3D, Gibson, or MP3D. No similar claim can be made about training on other datasets. HM3D-trained PointNav agents achieve 100% performance on Gibson-test dataset, suggesting that it might be time to retire that episode dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents the Habitat-Matterport 3D (HM3D) dataset of 1,000 building-scale 3D reconstructions from diverse real-world indoor locations. It claims HM3D surpasses prior datasets (MP3D, Gibson, Replica, ScanNet) in physical scale (112.5k m² navigable space, 1.4-3.7× larger), visual fidelity (20-85% higher w.r.t. real-camera images), and reconstruction completeness (34-91% fewer artifacts). The central empirical result is that PointGoal navigation agents trained on HM3D achieve the highest performance regardless of evaluation on HM3D, Gibson, or MP3D test sets, including 100% success on Gibson-test, making HM3D 'pareto optimal' with no analogous claim possible for other datasets.

Significance. If the performance gains hold under matched training conditions, HM3D supplies a substantially larger and higher-fidelity resource that could become the default training and evaluation environment for embodied AI, enabling more robust policies and potentially retiring smaller benchmarks such as Gibson. The direct cross-dataset comparisons and quantitative fidelity metrics constitute a concrete contribution that strengthens the empirical foundation of the field.

major comments (1)
  1. [PointGoal navigation experiments] PointGoal navigation experiments (abstract and results): the pareto-optimality claim requires explicit confirmation that training protocols were identical across HM3D, Gibson, and MP3D. Details on matched episode counts, total steps, sampling strategy, and hyperparameters are needed, because HM3D's 1.4-3.7× larger navigable area implies substantially more unique episodes; without this, superior transfer performance could arise from greater data volume rather than scale, completeness, or fidelity.
minor comments (1)
  1. [Abstract] Abstract: the statement '100% performance on Gibson-test dataset' should specify the exact metric (success rate, SPL, etc.) and any evaluation conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [PointGoal navigation experiments] PointGoal navigation experiments (abstract and results): the pareto-optimality claim requires explicit confirmation that training protocols were identical across HM3D, Gibson, and MP3D. Details on matched episode counts, total steps, sampling strategy, and hyperparameters are needed, because HM3D's 1.4-3.7× larger navigable area implies substantially more unique episodes; without this, superior transfer performance could arise from greater data volume rather than scale, completeness, or fidelity.

    Authors: We thank the referee for this observation. The training protocols were identical across HM3D, Gibson, and MP3D: the same hyperparameters were used for all runs, the same total number of training steps was performed, and episodes were sampled uniformly at random from the training scenes of each dataset. To ensure a fair comparison given the differing navigable areas, we matched the number of training episodes across datasets by subsampling from the larger ones (HM3D and Gibson) to equal the episode count available from the smallest dataset. This controlled for data volume, so that performance differences can be attributed to scale, fidelity, and completeness. The manuscript describes the shared experimental setup in Section 4, but we agree it would benefit from greater explicitness. We will revise the paper to add a dedicated paragraph and summary table confirming the matched episode counts, steps, sampling strategy, and hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset paper with direct experimental comparisons

full rationale

The paper introduces the HM3D dataset and supports its 'pareto optimal' claim via reported PointNav training and cross-evaluation results on HM3D, Gibson, and MP3D. No equations, parameter fits, or derivations appear in the provided text. The performance claim is an empirical observation from agent training runs, not a reduction to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. Training protocol details are not shown to collapse into the dataset properties by construction. This is a standard dataset contribution whose central assertions rest on external experimental outcomes rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset paper relying on established 3D scanning and reconstruction methods without introducing new free parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5672 in / 1080 out tokens · 51606 ms · 2026-05-14T18:18:41.752925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    SYNCR benchmark shows leading MLLMs reach only 52.5% average accuracy on cross-video reasoning tasks against an 89.5% human baseline, with major weaknesses in physical and spatial reasoning.

  2. InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

    cs.CV 2026-04 unverdicted novelty 7.0

    InHabit generates 78K photorealistic 3D human-scene interaction samples across 800 scenes by rendering scenes, using foundation models to propose actions and insert humans, then optimizing to SMPL-X bodies, improving ...

  3. Semantic Area Graph Reasoning for Multi-Robot Language-Guided Search

    cs.RO 2026-04 unverdicted novelty 7.0

    SAGR builds a semantic area graph from occupancy maps so LLMs can assign rooms to robots for language-guided search, staying competitive with standard exploration while improving semantic target finding by up to 18.8%...

  4. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  5. UniDAC: Universal Metric Depth Estimation for Any Camera

    cs.CV 2026-03 unverdicted novelty 7.0

    UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.

  6. When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

    cs.AI 2026-05 conditional novelty 6.0

    LongAct benchmark reveals top VLMs reach only 59% goal completion and 16% full success on long-horizon household tasks, while HoloMind agent improves results via DAG planner, multimodal spatial memory, episodic memory...

  7. Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

    cs.RO 2026-05 unverdicted novelty 6.0

    SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.

  8. Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation

    cs.RO 2026-05 unverdicted novelty 6.0

    PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.

  9. SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 6.0

    SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.

  10. OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    OVAL introduces an open-vocabulary memory model with structured descriptors and multi-value frontier scoring to enable efficient lifelong object goal navigation in unseen settings.

  11. Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

    cs.RO 2026-04 unverdicted novelty 6.0

    Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.

  12. FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    FSUNav's dual brain-inspired modules achieve state-of-the-art zero-shot goal navigation across heterogeneous robots with improved speed, safety, and generalization.

  13. Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...

  14. ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

    cs.RO 2026-03 conditional novelty 6.0

    ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.

  15. Learning Material-Aware Hamiltonian Risk Fields for Safe Navigation

    cs.LG 2026-05 unverdicted novelty 5.0

    A learned context-energy term in port-Hamiltonian policies creates selective risk navigation that activates evasive forces only when safer paths are available.

  16. TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation

    cs.CV 2026-05 unverdicted novelty 5.0

    TrajRAG uses a topological-polar trajectory representation and hierarchical retrieval to accumulate and reuse geometric-semantic navigation experiences, improving zero-shot ObjectNav on MP3D and HM3D benchmarks.

  17. UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

    cs.CV 2026-04 unverdicted novelty 5.0

    UpstreamQA disentangles video reasoning by using LRMs for explicit upstream object identification and scene context before downstream LMM VideoQA, improving performance and interpretability on OpenEQA and NExTQA in so...

  18. Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

    cs.CV 2026-04 unverdicted novelty 5.0

    ABot-Explorer unifies online exploration and hierarchical semantic memory construction via VLM-distilled navigational affordances for improved embodied navigation efficiency.

  19. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

  20. Think before Go: Hierarchical Reasoning for Image-goal Navigation

    cs.RO 2026-04 unverdicted novelty 5.0

    HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.

  21. HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.

  22. IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments

    cs.RO 2026-03 unverdicted novelty 5.0

    IGV-RRT improves object goal navigation in dynamic indoor environments by combining uncertainty-aware priors from 3D scene graphs with online VLM observations in a real-time tree planner.

  23. A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

    cs.RO 2026-04 unverdicted novelty 4.0

    A modular VLN architecture builds a cognitive memory graph, decomposes it for VLM reasoning, and solves a weighted traveling repairman problem for context-aware exploration to achieve real-time performance and higher ...

  24. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 24 Pith papers · 6 internal anchors

  1. [1]

    SceneNN: A scene meshes dataset with annotations

    Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN: A scene meshes dataset with annotations. In 2016 Fourth International Conference on 3D Vision (3DV), pages 92–101. IEEE, 2016. 2, 3

  2. [2]

    ScanNet: Richly-annotated 3D reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 2, 3, 5

  3. [3]

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2D-3D-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017. 2, 3

  4. [4]

    Matterport3D: Learning from RGB-D data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. Fifth International Conference on 3D Vision (3DV), 2017. 2, 3, 5, 13

  5. [5]

    Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

    Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018. 2, 4, 5, 13

  6. [6]

    Habitat: A Platform for Embodied AI Research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9339–9347, 2019. 2, 3, 4, 5, 6, 7

  7. [7]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018. 2, 7

  8. [8]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-Thor: An interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474, 2017. 3

  9. [9]

    Chalet: Cornell house agent learning environment

    Claudia Yan, Dipendra Misra, Andrew Bennnett, Aaron Walsman, Yonatan Bisk, and Yoav Artzi. Chalet: Cornell house agent learning environment. arXiv preprint arXiv:1801.07357, 2018. 3

  10. [10]

    VirtualHome: Simulating household activities via programs

    Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. VirtualHome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018

  11. [11]

    Habitat 2.0: Training home assistants to rearrange their habitat

    Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. arXiv preprint arXiv:2106.14405, 2021. 3

  12. [12]

    Semantic scene completion from a single depth image

    Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1746–1754, 2017

  13. [13]

    2021.doi: 10.48550/arXiv.2011.09127

    Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu Zheng, Ying Li, et al. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics. arXiv preprint arXiv:2011.09127, 2020. 3

  14. [14]

    RoboTHOR: An open simulation-to-real embodied AI platform

    Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, and Ali Farhadi. RoboTHOR: An open simulation-to-real embodied AI platform. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 31...

  15. [15]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019. 3, 5

  16. [16]

    Rescan: Inductive instance segmentation for indoor RGBD scans

    Maciej Halber, Yifei Shi, Kai Xu, and Thomas Funkhouser. Rescan: Inductive instance segmentation for indoor RGBD scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2541–2550, 2019

  17. [17]

    RIO: 3D object instance re-localization in changing indoor environments

    Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. RIO: 3D object instance re-localization in changing indoor environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7658–7667, 2019. 3

  18. [18]

    3D semantic parsing of large-scale indoor spaces

    Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1534–1543, 2016. 3, 13 10

  19. [19]

    iGibson, a simulation environment for interactive tasks in large realistic scenes

    Bokui Shen, Fei Xia, Chengshu Li, Roberto Martın-Martın, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D’Arpino, Sanjana Srivastava, Lyne P Tchapmi, Kent Vainio, Li Fei-Fei, and Silvio Savarese. iGibson, a simulation environment for interactive tasks in large realistic scenes. arXiv preprint, 2020. 3, 6

  20. [20]

    ARKitScenes-a diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data, 2021

    Afshin Dehghan, Gilad Baruch, Zhuoyuan Chen, Yuri Feigin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, and Elad Shulman. ARKitScenes-a diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data, 2021. URL https://openreview.net/ pdf?id=tjZjv_qh_CE. 3

  21. [21]

    https://www.nii.ac.jp/dsc/idr/lifull/

    LIFULL HOME. https://www.nii.ac.jp/dsc/idr/lifull/. 3

  22. [22]

    Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis

    Ahti Kalervo, Juha Ylioinas, Markus Häikiö, Antti Karhu, and Juho Kannala. Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis. In Scandinavian Conference on Image Analysis, pages 28–40. Springer, 2019

  23. [23]

    Data-driven interior plan generation for residential buildings

    Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, Yu-Hao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings. ACM Transactions on Graphics (TOG), 38(6):1–12, 2019. 3

  24. [24]

    Zillow indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts

    Steve Cruz, Will Hutchcroft, Yuguang Li, Naji Khosravan, Ivaylo Boyadzhiev, and Sing Bing Kang. Zillow indoor dataset: Annotated floor plans with 360deg panoramas and 3d room layouts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2133–2143, 2021. 3

  25. [25]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 6

  26. [26]

    Demystifying MMD GANs

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. arXiv preprint arXiv:1801.01401, 2018. 6

  27. [27]

    Cognitive mapping and planning for visual navigation

    Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017. 6

  28. [28]

    Semi-parametric topological memory for navigation

    Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In International Conference on Learning Representations, 2018

  29. [29]

    DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames

    Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. InInternational Conference on Learning Representations (ICLR), 2020. 7, 8, 13, 19

  30. [30]

    Neural topological slam for visual navigation

    Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, and Saurabh Gupta. Neural topological slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12875–12884, 2020. 6

  31. [31]

    Robot navigation in constrained pedestrian environments using reinforcement learning

    Claudia Pérez-D’Arpino, Can Liu, Patrick Goebel, Roberto Martín-Martín, and Silvio Savarese. Robot navigation in constrained pedestrian environments using reinforcement learning. arXiv preprint arXiv:2010.08600, 2020. 7

  32. [32]

    Occupancy anticipation for efficient exploration and navigation

    Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efficient exploration and navigation. In European Conference on Computer Vision, pages 400–418. Springer, 2020

  33. [33]

    Differentiable slam-net: Learning particle slam for visual navigation

    Peter Karkus, Shaojun Cai, and David Hsu. Differentiable slam-net: Learning particle slam for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2815–2825, 2021. 7

  34. [34]

    Objectnav revisited: On evaluation of embodied agents navigating to objects.arXiv preprint arXiv:2006.13171, 2020

    Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, and Erik Wijmans. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171, 2020. 7

  35. [35]

    Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding

    Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954, 2020

  36. [36]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018. 7

  37. [37]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7, 13

  38. [38]

    LSTM can solve hard long time lag problems

    Sepp Hochreiter and Jürgen Schmidhuber. LSTM can solve hard long time lag problems. Advances in neural information processing systems, pages 473–479, 1997. 7

  39. [39]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 9

  40. [40]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

  41. [41]

    Revisiting unreasonable effectiveness of data in deep learning era

    Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision , pages 843–852, 2017. 11

  42. [42]

    Billion-scale semi-supervised learning for image classification

    I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019. 9 6 Acknowledgements We thank all the volunteers who contributed to the dataset curation effort: Harsh Agrawal, Sashank Gondala, Rishabh Jain, Shawn Jiang, Yash Kant, Noah Maestre, ...

  43. [43]

    We calculate the normalized histogram of geodesic distances between the start and goal locations for each episode in the train and val splits (independently)

    EMD (train, val) measures the dissimilarity between episodes in the train and val splits. We calculate the normalized histogram of geodesic distances between the start and goal locations for each episode in the train and val splits (independently). We then measure the distribution shift between the train and val episodes. This is done by computing the Ear...

  44. [44]

    This is calculated as the mean of KID (Gibson real) and KID (MP3D real) from Table 5(b) in the main paper

    KID (mean) is a measure of visual fidelity of images rendered from each dataset. This is calculated as the mean of KID (Gibson real) and KID (MP3D real) from Table 5(b) in the main paper

  45. [45]

    % defects

    % defects is a measure of reconstruction completeness for the 3D scans. For each dataset, this is calculated as the mean of “% defects" values from Figure 4 in the main paper

  46. [46]

    It is computed as the overall navigable area in the training scans for each dataset

    Navigable area (m2) measures the dataset size. It is computed as the overall navigable area in the training scans for each dataset. We compute the above metrics for all the train datasets6. For a given PointNav val set, we measure the Pearson’s correlation between each of the above metrics for a train dataset and the navigation SPL achieved by agents trai...