pith. sign in

arxiv: 2509.10813 · v4 · submitted 2025-09-13 · 💻 cs.CV · cs.RO

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Pith reviewed 2026-05-18 17:08 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords indoor scene datasetsimulatable scenesembodied AIrealistic layoutsscene generationpoint-goal navigation3D object collectionsdata processing pipeline
0
0 comments X

The pith

InternScenes integrates real scans, procedural and designer scenes into 40,000 simulatable indoor environments with realistic layouts and small objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that current indoor scene datasets fall short for embodied AI because they are too small, lack small objects, and contain unresolved collisions. It addresses this by combining three different scene sources into roughly 40,000 scenes that contain 1.96 million objects across 15 types and 288 classes, keeping an average of 41.5 objects per region. A processing pipeline turns scanned rooms into simulatable versions, adds interactive elements, and uses physics to clear collisions. Benchmarks on layout generation and point-goal navigation illustrate that these richer scenes create harder problems yet make large-scale training feasible for both tasks.

Core claim

InternScenes is a large-scale simulatable indoor scene dataset built by merging real-world scans, procedurally generated scenes, and designer-created scenes to produce approximately 40,000 diverse environments containing 1.96M 3D objects, 15 common scene types, and 288 object classes. The dataset preserves massive numbers of small items, yielding realistic layouts with an average of 41.5 objects per region. A dedicated processing pipeline creates real-to-sim replicas, inserts interactive objects, and eliminates collisions through physical simulation, thereby supporting training at scale for embodied AI tasks such as scene layout generation and point-goal navigation.

What carries the argument

The data processing pipeline that converts real scans into simulatable replicas, adds interactive objects, and clears collisions with physical simulations while retaining scene diversity.

If this is right

  • Scene layout generation and point-goal navigation tasks become feasible at larger scale because the dataset supplies complex yet collision-free environments.
  • Training on these scenes exposes models to new difficulties arising from dense small-object layouts that earlier datasets omitted.
  • The combination of real scans and generated content allows direct comparison of performance across different scene origins within the same benchmark suite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could support additional embodied tasks such as object manipulation or multi-agent interaction once interactive objects are more fully utilized.
  • Similar merging pipelines might be applied to outdoor or dynamic environments to create comparable large-scale resources.
  • Open-sourcing the data and benchmarks invites direct replication studies that measure how much the added small objects and collision resolution improve downstream robotic transfer.

Load-bearing premise

The processing steps that turn real scans into simulatable versions, add interactive objects, and remove collisions succeed without creating new artifacts or lowering scene variety.

What would settle it

Models trained on InternScenes show no measurable gain in success rate or efficiency for point-goal navigation or layout generation when tested against models trained only on prior smaller datasets.

Figures

Figures reproduced from arXiv: 2509.10813 by Bo Dai, Hanqing Wang, Jiangmiao Pang, Jingli Lin, Li Luo, Peizhou Cao, Tai Wang, Weipeng Zhong, Wenzhe Cai, Xudong Xu, Yichen Jin, Zhaoyang Lyu.

Figure 1
Figure 1. Figure 1: InternScenes is a large-scale, simulatable indoor scene dataset with diverse layouts and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for retrieving synthetic scenes from real scan scenes [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for annotating and processing raw scenes to extract precise layout information. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples from InternScenes-Real2Sim. Each scene shows its BEV map as well as one [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples from InternScenes-Gen. The BEV map and one isometric view are shown. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples from InternScenes-Synthetic. The BEV map and one isometric view are shown. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Region statistics. Our dataset includes 15 common scene categories, such as the resting room and the living room. We also show the distribution of region areas [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of objects across 288 categories. We list the 30 categories with the highest [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Object bounding boxes volume statistics. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of object density (number of objects per m [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of 100 object categories conditioned on 15 different types [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of symmetrical L-shaped couches. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of BEV maps and rendered images of their corresponding sampling points [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Instance annotation interface UI 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Inspection results of scene 4 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Inspection results of scene 9 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Examples of regions generated by baseline models trained on a simplified version of the [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Examples of regions generated by baseline models trained on the full version of the [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Scenes for the navigation evaluation [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
read the original abstract

The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InternScenes, a large-scale simulatable indoor scene dataset of approximately 40,000 scenes constructed by integrating real-world scans, procedurally generated scenes, and designer-created scenes. It contains 1.96M 3D objects across 15 scene types and 288 classes, with an average of 41.5 objects per region to preserve small items and achieve realistic, complex layouts. A data processing pipeline creates real-to-sim replicas, adds interactive objects, and resolves collisions through physical simulations. The dataset is evaluated on scene layout generation and point-goal navigation benchmarks, which demonstrate new challenges from the complex layouts and support scaling model training.

Significance. If the pipeline produces scenes that remain both simulatable and faithful to realistic layouts, InternScenes would address key limitations in existing datasets (scale, diversity, small-object density, and collision-free simulatability) and provide a useful resource for Embodied AI research. The open-sourcing commitment and the two benchmark tasks that expose scaling challenges are positive contributions.

major comments (2)
  1. [Abstract / Data Processing Pipeline] Abstract and pipeline description: the claim that physical simulation resolves collisions while preserving realistic layouts lacks any quantitative pre-/post-simulation metrics (overlap volume, centroid displacement, layout entropy, or small-object distribution statistics). Without these, it is impossible to verify that the dynamics-driven adjustments do not systematically alter placements in dense scenes (average 41.5 objects per region).
  2. [Abstract] Abstract: no error metrics, fidelity scores, or before-after comparisons are reported for the real-to-sim replica creation step, leaving the central claim of simulatability and realism without empirical grounding.
minor comments (2)
  1. [Abstract] The abstract states coverage of 15 scene types and 288 object classes but does not clarify how these categories were defined or validated against standard taxonomies.
  2. Figure and table captions should explicitly state whether statistics (e.g., object counts, region averages) are computed before or after the collision-resolution step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the presentation of the data processing pipeline.

read point-by-point responses
  1. Referee: [Abstract / Data Processing Pipeline] Abstract and pipeline description: the claim that physical simulation resolves collisions while preserving realistic layouts lacks any quantitative pre-/post-simulation metrics (overlap volume, centroid displacement, layout entropy, or small-object distribution statistics). Without these, it is impossible to verify that the dynamics-driven adjustments do not systematically alter placements in dense scenes (average 41.5 objects per region).

    Authors: We agree that quantitative pre-/post-simulation metrics would provide stronger empirical support for the claim that physical simulation resolves collisions without systematically distorting realistic layouts. The current manuscript describes the simulation-based collision resolution process and reports the final average of 41.5 objects per region, but does not include the suggested before-and-after statistics. In the revised manuscript we will add a dedicated analysis subsection with metrics including average overlap volume reduction, mean centroid displacement, layout entropy change, and small-object count distribution before versus after simulation. These will be computed on a representative subset of scenes and reported in the data processing pipeline section. revision: yes

  2. Referee: [Abstract] Abstract: no error metrics, fidelity scores, or before-after comparisons are reported for the real-to-sim replica creation step, leaving the central claim of simulatability and realism without empirical grounding.

    Authors: We acknowledge that the abstract and the high-level pipeline description do not report explicit error metrics or fidelity scores for the real-to-sim replica creation. The full manuscript details the replica generation procedure (including mesh cleaning, texture mapping, and physics-ready asset conversion), yet lacks the quantitative before-after comparisons suggested. We will revise the abstract to briefly reference the fidelity evaluation and add a new results subsection with geometric error (e.g., Chamfer distance), visual similarity scores, and collision-free success rates for the replica creation step on a held-out set of real scans. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset curation paper with no derivation chain

full rationale

This paper presents a new indoor scene dataset constructed by integrating three external scene sources, applying a data processing pipeline for real-to-sim conversion and collision resolution via physical simulation, and preserving small objects for realism. No equations, fitted parameters, predictions of derived quantities, uniqueness theorems, or ansatzes appear in the abstract or described contributions. The central claims concern empirical scale, diversity, and simulatability of the resulting 40k scenes rather than any reduction of outputs to inputs by construction or self-citation. The work is therefore self-contained as a data release with benchmark applications, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of an unspecified data processing pipeline that converts scans, adds interactivity, and runs physical simulations; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Physical simulation can reliably detect and resolve object collisions in complex indoor layouts without altering scene semantics.
    Invoked in the abstract description of the pipeline that resolves object collisions.

pith-pipeline@v0.9.0 · 5820 in / 1241 out tokens · 33027 ms · 2026-05-18T17:08:30.413107+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Orchestrating Spatial Semantics via a Zone-Graph Paradigm for Intricate Indoor Scene Generation

    cs.RO 2026-05 unverdicted novelty 7.0

    ZoneMaestro introduces a zone-graph orchestration approach with a new dataset and alternating optimization strategy to generate intricate indoor scenes that maintain both semantic intent and geometric validity.

  2. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  3. Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Pair2Scene generates complex 3D scenes beyond training data by recursively applying a learned model of local support and functional object-pair relations inside hierarchies, using collision-aware rejection sampling fo...

  4. Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 3 Pith papers · 6 internal anchors

  1. [1]

    Demystifying MMD GANs

    M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401,

  2. [2]

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012,

  3. [3]

    Midi: Multi-instance diffusion for single image to 3d scene generation

    Z. Huang, Y.-C. Guo, X. An, Y. Yang, Y. Li, Z.-X. Zou, D. Liang, X. Liu, Y.-P. Cao, and L. Sheng. Midi: Multi-instance diffusion for single image to 3d scene generation.arXiv preprint arXiv:2412.03558,

  4. [4]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,

  5. [5]

    W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leuteneg- ger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset.arXiv preprint arXiv:1809.00716,

  6. [6]

    T. Luo, C. Rockwell, H. Lee, and J. Johnson. Scalable 3d captioning with pretrained models.arXiv preprint arXiv:2306.07279,

  7. [7]

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238,

  8. [8]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797,

  9. [9]

    Zheng, J

    J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 519–535. Springer,

  10. [10]

    air c o n d i t i o n e r

    method. Notably, to enhance the realism of small object placements within scenes—particularly their ability to reside inside furniture with cavities (e.g., drawers or shelves)—we first perform a simple segmentation on cavity-containing furniture, breaking them into smaller components that expose the internal cavities. Each of these components is then indi...

  11. [11]

    GPU memory usage (in GB) under different levels of parallel simulation. Scene Type Parallel=1 Parallel=20 Parallel=40 OmniScenes-Real2Sim 2.528 GB 5.205 GB 5.385 GB OmniScenes-Gen 5.399 GB 5.476 GB 5.679 GB OmniScenes-Synthetic 7.542 GB 7.785 GB 8.168 GB 28 C. System Performance and Resource Requirements Detailed Performance Metrics.We perform a comprehen...

  12. [12]

    Discussion on Procedural Generation with Infinigen Indoor To enrich the diversity of generated assets and layouts in our dataset, we leverage Infinigen In- doors Raistrick et al

    D. Discussion on Procedural Generation with Infinigen Indoor To enrich the diversity of generated assets and layouts in our dataset, we leverage Infinigen In- doors Raistrick et al. (2024), a procedural generation framework designed to mitigate risks of introducing bias in spatial configurations and object co-occurrence patterns through fully randomized a...