PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis

Jinrang Jia; Yifeng Shi; Yijiang Hu; Zhenjia Li

arxiv: 2605.17916 · v2 · pith:7MSJECKAnew · submitted 2026-05-18 · 💻 cs.CV

PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis

Jinrang Jia , Zhenjia Li , Yijiang Hu , Yifeng Shi This is my paper

Pith reviewed 2026-05-20 11:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords panorama synthesiswhole-house generationspatial consistencygenerative model3D Gaussian SplattingVR tourautoregressive generationfloorplan to 3D

0 comments

The pith

PanoWorld generates consistent whole-house panoramas by decoupling shell-based geometry from visual memory cache.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PanoWorld introduces a generative spatial world model for creating coherent 360-degree panoramas across an entire house from a floorplan and style reference. It generates panoramas autoregressively, one node at a time, to match real VR navigation patterns. The system uses a simple 3D shell derived from the floorplan to guide global geometry and a dynamic cache of 3D Gaussian Splats to store and render visual memory. This separation allows high-detail 2D image quality to be kept while ensuring that layouts, materials, and connections remain consistent when moving between rooms. Such a method could make it practical to produce photorealistic virtual tours without the high cost of full 3D modeling or the inconsistencies of pure image generators.

Core claim

PanoWorld treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas. It uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM lifts panoramas into local 3DGS updates, Room-aware Group Attention suppresses cross-room interference, and topology-aware progressive caching fuses updates without full reconstruction. By decoupling shell-based geometry guidance from cache-rendered visual memory, the model preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency.

What carries the argument

Decoupling of a floorplan-derived 3D shell for geometry guidance from a dynamic 3D Gaussian Splatting cache for visual memory, combined with room-aware attention and progressive caching.

Load-bearing premise

The floorplan-derived 3D shell provides a sufficient global geometric proxy to maintain spatial coherence across multiple rooms without requiring full metric 3D reconstruction or additional depth sensors.

What would settle it

Generating panoramas for a complex multi-room floorplan and measuring if adjacent room views show matching layouts, door positions, and material properties when the viewpoint shifts.

Figures

Figures reproduced from arXiv: 2605.17916 by Jinrang Jia, Yifeng Shi, Yijiang Hu, Zhenjia Li.

**Figure 1.** Figure 1: Teaser of PanoWorld. Given a floorplan and a style reference, PanoWorld synthesizes a node-based whole-house panorama tour. A floorplan-derived geometric proxy anchors the global structure, while a dynamic 3DGS cache progressively expands along the navigation path and provides renderable spatial memory. The generated panoramas preserve photorealistic detail and cross-room consistency, e.g., doorway geometr… view at source ↗

**Figure 2.** Figure 2: Room-aware panoramic LRM. Grouped attention allows dense intra-room interaction and restricted cross-room communication only through topological boundaries. αk the opacity, and ck the color feature. Each panorama is encoded with an equirectangular image encoder, and the decoder maps fused tokens to Gaussian parameters in the global coordinate frame. 3.4.1. Panoramic Position Encoding We adapt the Plucker… view at source ↗

**Figure 3.** Figure 3: Progressive 3DGS caching. PanoWorld updates spatial memory through local topology-aware increments instead of fullhistory reconstruction. guidance rather than merely producing a plausible standalone reconstruction. 3.5. Topology-Aware Progressive 3DGS Caching A naive autoregressive system could rerun the LRM on all previously generated panoramas after every new node. This quickly becomes impractical: mem… view at source ↗

**Figure 4.** Figure 4: Cross-room memory filtering. Shell depth removes cache pixels that lie behind the first visible room surface and would otherwise introduce large erroneous textures. renders the current cache into the target pose and obtains a visual memory image Vt = RCt−1 (vt). The generator then predicts It = Φ(Gt, Vt, Ip(t)), (11) where Gt is the geometric proxy and Ip(t) is a nearby generated panorama. The nearby pano… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on whole-house panorama synthesis. We compare PanoWorld with representative adapted baselines on multi-node panorama generation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: PanoWorld qualitative results under different target styles. PanoWorld preserves cross-room geometry and material identity while generating furnished panoramas under different target styles. compare against MVP [16], Adapt-Splat [43], and World- Mirror 2.0 [14] under 8-panorama and 12-panorama input [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Whole-house LRM reconstruction visualization. The comparison shows room-level panorama renderings for different reconstruction methods. settings. PanoWorld obtains the best reconstruction quality in both input settings, demonstrating its advantage in metric-scale multi-room whole-house reconstruction. The 12-panorama setting is slightly lower than the 8-panorama setting for PanoWorld because the addition… view at source ↗

**Figure 8.** Figure 8: Effect of panoramic position encoding. Removing circular panoramic encoding causes left-right inconsistency and seam artifacts in generated panoramas. 7. Visualization without Panoramic Position Encoding We include qualitative failure cases for the variant without Panoramic Position Encoding. Without circular horizontal encoding, the generator treats the left and right panorama boundaries as distant image … view at source ↗

**Figure 10.** Figure 10: Seedream-4.5-Edit adaptation pipeline. We convert the shell image into a line drawing and use a simplified prompt to improve spatial-structure following. protocol in two steps. First, we convert the shell image into a line drawing to emphasize the spatial structure. Second, we use a simplified fixed prompt for panoramic rendering: Please render this first panoramic line drawing into a panoramic rendering,… view at source ↗

**Figure 9.** Figure 9: Nano Banana 2 adaptation pipeline. We use Gemini3.1-flash-image-preview with a geometry-control image, a style reference, and a fixed descriptive prompt. space is anchored by a large rectangular woven rug in light grey with geometric pattern that defines the seating area while adding subtle texture. Ceiling lighting includes minimalist round fixtures that provide overall illumination without visual clut… view at source ↗

**Figure 11.** Figure 11: Training data visualization. Examples from 3D-FRONT and RealSee3D with panoramas, depth or shell-proxy images, and room-level BEV maps showing room partitions, doorways, sampled camera nodes, and local room groups [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Cross-node consistency evaluation regions. We manually select co-visible 1m × 1m regions on planar shell surfaces, densely sample 3D points, project them into multiple panorama nodes, and compute PSNR over corresponding pixels [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

read the original abstract

Generating a consistent whole-house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross-view spatial coherence. Pure 2D generators produce appealing single panoramas but re-imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi-room scale. We introduce PanoWorld, a generative spatial world model that treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM designed for metric-scale multi-room 360-degree inputs lifts generated panoramas into local 3DGS updates, while Room-aware Group Attention suppresses cross-room feature interference. A topology-aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell-based geometry guidance from cache-rendered visual memory, PanoWorld preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency. The project link is https://jjrcn.github.io/PanoWorld-project-home/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PanoWorld stitches together a floorplan 3D shell, dynamic 3DGS cache, panoramic LRM, and room-aware attention into an autoregressive pipeline for house-scale panoramas, but the abstract supplies no numbers to show the consistency actually improves.

read the letter

The main takeaway is that this paper describes a concrete system for generating consistent 360-degree panoramas across an entire house from a floorplan and style reference. It frames the task as autoregressive node generation rather than a single monolithic 3D model, which matches how real VR tours work. The architecture uses a floorplan-derived 3D shell as global geometry guidance, a dynamic 3D Gaussian Splatting cache as visual memory, a feed-forward panoramic LRM to lift each new panorama into local updates, Room-aware Group Attention to limit cross-room interference, and topology-aware progressive caching to fuse those updates without full history reconstruction. The decoupling of shell geometry from cache visuals is presented as the way to keep high-frequency 2D quality while gaining layout and material coherence across nodes. That combination of pieces is new enough to be worth noting, even if the individual components draw from recent 3DGS and LRM work. The paper also clearly states the practical problem: pure 2D generators drift on geometry and materials when the viewpoint moves, while full 3D generation loses texture detail at house scale. Those observations are fair and set up the design choices directly. The soft spot is the complete absence of quantitative evidence in the abstract—no error metrics, no baseline comparisons, no ablations on the attention module or caching strategy. Claims about improved consistency therefore rest on the design rationale alone. The stress-test point about the 3D shell is also reasonable to raise: standard floorplans are usually simple extruded polygons, so they can miss ceiling heights, sloped surfaces, or non-Manhattan walls. Any mismatch could propagate through the autoregressive steps and the cache. If the full methods section shows how the shell construction handles those cases, that would address it; otherwise the central assumption stays untested. This paper is for researchers working on generative models for indoor VR and scene synthesis. A reader who needs a practical architecture for scalable, consistent panorama generation would get concrete ideas from it. It deserves a serious referee because the problem is real and the proposed pipeline is a thoughtful assembly of existing tools aimed at a clear application. I would send it out for review, with the expectation that experiments and ablations get added or strengthened.

Referee Report

3 major / 2 minor

Summary. The paper introduces PanoWorld, a generative spatial world model for synthesizing consistent whole-house 360-degree panoramas from a floorplan and style reference. It frames the task as autoregressive node-based panorama generation, using a floorplan-derived 3D shell as global geometric proxy, a dynamic 3D Gaussian Splatting cache as renderable spatial memory, a feed-forward panoramic LRM for local metric-scale updates, Room-aware Group Attention to suppress cross-room interference, and a topology-aware progressive caching strategy to fuse updates without full history reconstruction. The central contribution is decoupling shell-based geometry guidance from cache-rendered visual memory to preserve high-frequency 2D synthesis quality while improving cross-node layout and material consistency.

Significance. If the central claims hold under quantitative validation, PanoWorld offers a promising scalable alternative to monolithic 3D generation or pure 2D synthesis for VR tour content, balancing photorealism with spatial coherence at house scale. The decoupling strategy and use of 3DGS cache with LRM updates represent a thoughtful architectural choice that could influence future work on multi-view generative consistency; the introduction of Room-aware Group Attention and topology-aware caching are specific technical contributions worth further exploration if supported by evidence.

major comments (3)

Abstract and Experiments: The manuscript describes the architecture and intended benefits but supplies no quantitative results, ablation studies, error metrics, or baseline comparisons. This absence is load-bearing for the central claim of improved cross-node consistency, as the benefits of the proposed decoupling and attention mechanisms remain unverified.
Methods (floorplan-derived 3D shell): The assumption that a standard floorplan-derived 3D shell (typically 2D polygons extruded to uniform height) provides a sufficient global geometric proxy for cross-room coherence is not accompanied by validation or discussion of limitations. This is critical because mismatches in non-Manhattan layouts, multi-height rooms, sloped roofs, or varying wall/door heights could propagate through the autoregressive generation and topology-aware cache, undermining the consistency gains.
Methods (Room-aware Group Attention and LRM updates): The claim that local LRM updates combined with Room-aware Group Attention maintain layout and material coherence without full metric reconstruction lacks concrete analysis of how these components interact with the proxy shell under partial mismatches, which is necessary to establish the decoupling's effectiveness.

minor comments (2)

Abstract: The project link is given, but the text would benefit from a brief statement on the scope of the evaluation (e.g., number of houses or room types tested) to set reader expectations.
Notation and terminology: The novel terms 'Room-aware Group Attention' and 'topology-aware progressive caching strategy' are introduced without immediate contrast to standard multi-head attention or FIFO caching; a short clarifying paragraph or diagram in the methods would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that quantitative validation and expanded methodological analysis are important to substantiate the claims. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: Abstract and Experiments: The manuscript describes the architecture and intended benefits but supplies no quantitative results, ablation studies, error metrics, or baseline comparisons. This absence is load-bearing for the central claim of improved cross-node consistency, as the benefits of the proposed decoupling and attention mechanisms remain unverified.

Authors: We acknowledge that the current manuscript does not include quantitative results, ablations, or baseline comparisons, which limits the strength of the consistency claims. The initial submission prioritized describing the novel architecture and its components. In the revision we will add a dedicated Experiments section containing quantitative metrics for cross-node layout and material consistency, ablation studies on the Room-aware Group Attention and dynamic 3DGS cache, and comparisons against 2D autoregressive panorama generators as well as monolithic 3D scene synthesis baselines. These additions will directly support the central claims. revision: yes
Referee: Methods (floorplan-derived 3D shell): The assumption that a standard floorplan-derived 3D shell (typically 2D polygons extruded to uniform height) provides a sufficient global geometric proxy for cross-room coherence is not accompanied by validation or discussion of limitations. This is critical because mismatches in non-Manhattan layouts, multi-height rooms, sloped roofs, or varying wall/door heights could propagate through the autoregressive generation and topology-aware cache, undermining the consistency gains.

Authors: We agree that the limitations of the floorplan-derived 3D shell require explicit discussion. The shell is used as a lightweight global proxy rather than an exact reconstruction. In the revised Methods section we will add a subsection addressing its assumptions and potential shortcomings for non-Manhattan layouts, multi-height rooms, and sloped structures. We will also explain how the local LRM updates and progressive 3DGS caching limit error propagation by focusing on visual and local geometric corrections. Additional qualitative examples on diverse floorplan types will be included. revision: yes
Referee: Methods (Room-aware Group Attention and LRM updates): The claim that local LRM updates combined with Room-aware Group Attention maintain layout and material coherence without full metric reconstruction lacks concrete analysis of how these components interact with the proxy shell under partial mismatches, which is necessary to establish the decoupling's effectiveness.

Authors: We recognize the need for more detailed analysis of component interactions. In the revision we will expand the Methods section with additional diagrams and explanations of the information flow between the Room-aware Group Attention, LRM updates, and the 3D shell proxy. This will include discussion of how attention suppresses cross-room interference and how topology-aware caching integrates local updates under partial geometric mismatches. The expanded analysis will better demonstrate the robustness of the decoupling strategy. revision: yes

Circularity Check

0 steps flagged

No circularity: method integrates external components without self-referential reduction

full rationale

The paper describes an architectural pipeline that combines a floorplan-derived 3D shell (external geometric proxy), dynamic 3D Gaussian Splatting cache, feed-forward panoramic LRM, Room-aware Group Attention, and topology-aware caching. These are presented as standard integrations of existing techniques (3DGS, LRM) applied to autoregressive node generation, with no equations or steps where a claimed prediction or consistency metric is fitted to itself or defined circularly in terms of the target output. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming reduces the central claim to prior author work by construction. The derivation remains self-contained, relying on design choices whose validity can be assessed against external benchmarks rather than internal parameter fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on domain assumptions about the utility of floorplan geometry and the effectiveness of the introduced attention and caching mechanisms; no free parameters or invented physical entities are specified in the abstract.

axioms (1)

domain assumption Floorplan-derived 3D shell serves as adequate global geometric proxy for multi-room coherence
Invoked when describing the global geometric proxy that guides generation.

invented entities (2)

Room-aware Group Attention no independent evidence
purpose: Suppress cross-room feature interference during panoramic LRM processing
New attention mechanism introduced to handle multi-room inputs.
topology-aware progressive caching strategy no independent evidence
purpose: Fuse local 3DGS updates without repeated full-history reconstruction
New caching approach for managing spatial memory across nodes.

pith-pipeline@v0.9.0 · 5758 in / 1422 out tokens · 55043 ms · 2026-05-20T11:31:33.605808+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory... Room-aware Group Attention suppresses cross-room feature interference.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

topology-aware progressive 3DGS caching... autoregressive generation of node-based 360-degree panoramas

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

[1]

pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Li, Andrea Sun, Jonathon Luiten, Gordon Wetzstein, and Leonidas Smith. pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25828–25838, 2024. 3

work page 2024
[2]

Dreamhome-pano: Design-aware and conflict-free panoramic interior generation, 2026

Lulu Chen, Yijiang Hu, Yuanqing Liu, Yulong Li, and Yue Yang. Dreamhome-pano: Design-aware and conflict-free panoramic interior generation, 2026. 6, 7

work page 2026
[3]

Text2light: Zero-shot text-driven hdr panorama generation.ACM Trans- actions on Graphics (TOG), 41(6):1–16, 2022

Zhaoxi Chen, Guangcong Wang, and Ziwei Liu. Text2light: Zero-shot text-driven hdr panorama generation.ACM Trans- actions on Graphics (TOG), 41(6):1–16, 2022. 2

work page 2022
[4]

Graph-to-3d: End-to-end generation and ma- nipulation of 3d scenes using scene graphs

Helisa Dhamo, Fabian Bobrovsky, Nassir Navab, and Fed- erico Tombari. Graph-to-3d: End-to-end generation and ma- nipulation of 3d scenes using scene graphs. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 16352–16361, 2021. 3

work page 2021
[5]

Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models, 2023

Mengyang Feng, Jinlin Liu, Miaomiao Cui, and Xuansong Xie. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models, 2023. 2

work page 2023
[6]

Scenescape: Text-driven consistent scene gener- ation

Rafail Fridman, Amit Carmeli, Tali Dekel, and Tomer Michaeli. Scenescape: Text-driven consistent scene gener- ation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023. 2

work page 2023
[7]

3d-front: 3d furnished rooms with layouts and semantics, 2021

Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Jiaming Wang Cao Li, Zengqi Xun, Chengyue Sun, Rongfei Jia, Bin- qiang Zhao, and Hao Zhang. 3d-front: 3d furnished rooms with layouts and semantics, 2021. 6, 3

work page 2021
[8]

Nano banana pro, 2025

Google. Nano banana pro, 2025. 6, 7, 1

work page 2025
[9]

Gs-lrm: Large reconstruc- tion model for 3d gaussian splatting

Zhenxing He, Zhisheng Wang, Yuhui Kuang, Min Zhao, Menglei Wang, Hao Chen, Fujun Luan, Thomas M ¨uller, Ji- aqi Wang, Chunhua Shen, et al. Gs-lrm: Large reconstruc- tion model for 3d gaussian splatting. InEuropean Confer- ence on Computer Vision (ECCV). Springer, 2024. 3

work page 2024
[10]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 7514–7528, Online and Punta Cana, Domini- can Republic, 2021. Association for Computational Linguis- tics. 7

work page 2021
[11]

Text2room: Extracting textured 3d meshes from 2d text-to-image models

Lukas H ¨ollein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7909–7920, 2023. 2

work page 2023
[12]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InICLR, 2024. 2

work page 2024
[13]

Mixed diffusion for 3d indoor scene synthesis.arXiv preprint arXiv:2405.21066, 2024

Siyi Hu, Diego Martin Arroyo, Stephanie Debats, Fabian Manhardt, Luca Carlone, and Federico Tombari. Mixed diffusion for 3d indoor scene synthesis.arXiv preprint arXiv:2405.21066, 2024. 3

work page arXiv 2024
[14]

Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds

Team HY-World. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds. arXiv preprint, 2026. 8

work page 2026
[15]

You only gaussian once: Controllable 3d gaussian splatting for ultra-densely sampled scenes, 2026

Jinrang Jia, Zhenjia Li, and Yifeng Shi. You only gaussian once: Controllable 3d gaussian splatting for ultra-densely sampled scenes, 2026. 2

work page 2026
[16]

Multi- view pyramid transformer: Look coarser to see broader

Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, and Eunbyung Park. Multi- view pyramid transformer: Look coarser to see broader. arXiv preprint arXiv:2512.07806, 2025. 4, 8

work page arXiv 2025
[17]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2

work page 2023
[18]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023. 6

work page 2023
[19]

Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation

Jialu Li and Mohit Bansal. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. InNeurIPS, 2023. 2

work page 2023
[20]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In ICLR, 2024. 3

work page 2024
[21]

Realsee3d: A large- scale multi-view rgb-d dataset of indoor scenes (version 1.0),

Linyuan Li, Yan Wu, Xi Li, Lingli Wang, Tong Rao, Jie Zhou, Cihui Pan, and Xinchen Hui. Realsee3d: A large- scale multi-view rgb-d dataset of indoor scenes (version 1.0),

work page
[22]

M-lrm: Multi-view large re- construction model.arXiv preprint arXiv:2406.07648, 2024

Mengfei Li, Xiaoxiao Long, Yixun Liang, Weiyu Li, Yuan Liu, Peng Li, Yatian Wang, Xingqun Qi, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. M-lrm: Multi-view large re- construction model.arXiv preprint arXiv:2406.07648, 2024. 3

work page arXiv 2024
[23]

Cameras as relative positional encod- ing

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encod- ing. InAdvances in Neural Information Processing Systems,

work page
[24]

Depth any panoramas: A foundation model for panoramic depth estimation

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2026. 6

work page 2026
[25]

Omniroam: World wandering via long-horizon panoramic video genera- tion.SIGGRAPH, 2026

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold-Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, Zifan Shi, and Yiwei Hu. Omniroam: World wandering via long-horizon panoramic video genera- tion.SIGGRAPH, 2026. 7, 2

work page 2026
[26]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15086–15095, 2025. 7

work page 2025
[27]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), pages 405–421. Springer, 2020. 2

work page 2020
[28]

What makes for text to 360-degree panorama gener- ation with stable diffusion? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Jinhong Ni, Chang-Bin Zhang, Qiang Zhang, and Jing Zhang. What makes for text to 360-degree panorama gener- ation with stable diffusion? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[29]

Atiss: Autoregres- sive transformers for indoor scene synthesis

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregres- sive transformers for indoor scene synthesis. InNeurIPS,

work page
[30]

Pano2room: Novel view synthesis from a single indoor panorama

Guo Pu, Yiming Zhao, and Zhouhui Lian. Pano2room: Novel view synthesis from a single indoor panorama. In SIGGRAPH Asia 2024 Conference Papers, New York, NY , USA, 2024. Association for Computing Machinery. 7

work page 2024
[31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1

work page 2022
[32]

Housediffusion: Vector floorplan generation via a diffusion model

Amin Shabani, Sepideh Hosseini, and Yasutaka Furukawa. Housediffusion: Vector floorplan generation via a diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5466–5475, 2023. 3

work page 2023
[33]

Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation. InECCV,

work page
[34]

Diffuscene: Denoising dif- fusion probabilistic models for generative indoor scene syn- thesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Jus- tus Thies, and Matthias Nießner. Diffuscene: Denoising dif- fusion probabilistic models for generative indoor scene syn- thesis. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[35]

Mvdiffusion: Enabling holistic multi- view image generation with correspondence-aware diffusion

Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. Mvdiffusion: Enabling holistic multi- view image generation with correspondence-aware diffusion. InNeurIPS, 2023. 2

work page 2023
[36]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Plan2scene: Convert- ing floorplans to 3d scenes

Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X Chang, and Manolis Savva. Plan2scene: Convert- ing floorplans to 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10733–10742, 2021. 2, 3

work page 2021
[39]

Moge-2: Accurate monocular geometry with metric scale and sharp details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdvances in Neural Information Processing Systems, 2025. 6

work page 2025
[40]

Sceneformer: Indoor scene generation with transformers

Xin-Yang Wang, Yu-An Yeh, Che-Wei Tang, Anton Rob- bins, and Yu-Chiang Frank Wang. Sceneformer: Indoor scene generation with transformers. In2021 International Conference on 3D Vision (3DV), pages 106–115. IEEE,

work page
[41]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page
[42]

Pan- odiffusion: 360-degree panorama outpainting via diffusion

Tianhao Wu, Chuanxia Zheng, and Tat-Jen Cham. Pan- odiffusion: 360-degree panorama outpainting via diffusion. arXiv preprint arXiv:2307.03177, 2023. 2

work page arXiv 2023
[43]

Adapt- splat: Adapting vision foundation models for feed-forward 3d gaussian splatting, 2026

Mingwei Xing, Xinliang Wang, and Yifeng Shi. Adapt- splat: Adapting vision foundation models for feed-forward 3d gaussian splatting, 2026. 8

work page 2026
[44]

Taming stable diffusion for text to 360 degree panorama im- age generation

Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xi- aoshui Huang, Dinh Phung, Wanli Ouyang, and Jianfei Cai. Taming stable diffusion for text to 360 degree panorama im- age generation. InCVPR, pages 6347–6357, 2024. 2

work page 2024
[45]

Pansplat: 4k panorama synthesis with feed-forward gaussian splatting

Cheng Zhang, Haofei Xu, Qianyi Wu, Camilo Cruz Gam- bardella, Dinh Phung, and Jianfei Cai. Pansplat: 4k panorama synthesis with feed-forward gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 2

work page 2025
[46]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7 PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis Supplemen...

work page 2018
[47]

Without circular horizontal encoding, the generator treats the left and right panorama boundaries as distant image regions rather than adjacent rays

Visualization without Panoramic Position Encoding We include qualitative failure cases for the variant without Panoramic Position Encoding. Without circular horizontal encoding, the generator treats the left and right panorama boundaries as distant image regions rather than adjacent rays. This often produces inconsistent structures or textures across the ...

work page
[48]

Baseline Adaptation Details 8.1. Pano2room Pano2room shares a broadly similar pipeline with our method, relying on monocular depth estimation to ob- tain a point cloud that is subsequently converted into a mesh, followed by iterative refinement through a render- then-estimate loop to progressively extend the scene to more distant regions. However, Pano2ro...

work page
[49]

Figure 11 shows representative examples from 3D-FRONT

Training Data Visualization We further visualize the training data used by PanoWorld. Figure 11 shows representative examples from 3D-FRONT

work page
[50]

The BEV maps illustrate the floorplan topology, room par- titions, doorway connectivity, sampled camera nodes, and local room groups used to construct the training views

and RealSee3D [21], including rendered panoramas, depth or shell-proxy images, and room-level BEV maps. The BEV maps illustrate the floorplan topology, room par- titions, doorway connectivity, sampled camera nodes, and local room groups used to construct the training views. These visualizations clarify the difference between syn- thetic CAD-derived scenes...

work page
[51]

Additional Experimental Details We include implementation details that are useful for repro- ducing the evaluation but too specific for the main paper, including panorama resolution, node sampling rules, and overlap-mask construction for cross-view PSNR. 10.1. Cross-Node Consistency Evaluation We evaluate cross-node consistency on manually selected co-vis...

work page

[1] [1]

pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Li, Andrea Sun, Jonathon Luiten, Gordon Wetzstein, and Leonidas Smith. pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25828–25838, 2024. 3

work page 2024

[2] [2]

Dreamhome-pano: Design-aware and conflict-free panoramic interior generation, 2026

Lulu Chen, Yijiang Hu, Yuanqing Liu, Yulong Li, and Yue Yang. Dreamhome-pano: Design-aware and conflict-free panoramic interior generation, 2026. 6, 7

work page 2026

[3] [3]

Text2light: Zero-shot text-driven hdr panorama generation.ACM Trans- actions on Graphics (TOG), 41(6):1–16, 2022

Zhaoxi Chen, Guangcong Wang, and Ziwei Liu. Text2light: Zero-shot text-driven hdr panorama generation.ACM Trans- actions on Graphics (TOG), 41(6):1–16, 2022. 2

work page 2022

[4] [4]

Graph-to-3d: End-to-end generation and ma- nipulation of 3d scenes using scene graphs

Helisa Dhamo, Fabian Bobrovsky, Nassir Navab, and Fed- erico Tombari. Graph-to-3d: End-to-end generation and ma- nipulation of 3d scenes using scene graphs. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 16352–16361, 2021. 3

work page 2021

[5] [5]

Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models, 2023

Mengyang Feng, Jinlin Liu, Miaomiao Cui, and Xuansong Xie. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models, 2023. 2

work page 2023

[6] [6]

Scenescape: Text-driven consistent scene gener- ation

Rafail Fridman, Amit Carmeli, Tali Dekel, and Tomer Michaeli. Scenescape: Text-driven consistent scene gener- ation. InSIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023. 2

work page 2023

[7] [7]

3d-front: 3d furnished rooms with layouts and semantics, 2021

Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Jiaming Wang Cao Li, Zengqi Xun, Chengyue Sun, Rongfei Jia, Bin- qiang Zhao, and Hao Zhang. 3d-front: 3d furnished rooms with layouts and semantics, 2021. 6, 3

work page 2021

[8] [8]

Nano banana pro, 2025

Google. Nano banana pro, 2025. 6, 7, 1

work page 2025

[9] [9]

Gs-lrm: Large reconstruc- tion model for 3d gaussian splatting

Zhenxing He, Zhisheng Wang, Yuhui Kuang, Min Zhao, Menglei Wang, Hao Chen, Fujun Luan, Thomas M ¨uller, Ji- aqi Wang, Chunhua Shen, et al. Gs-lrm: Large reconstruc- tion model for 3d gaussian splatting. InEuropean Confer- ence on Computer Vision (ECCV). Springer, 2024. 3

work page 2024

[10] [10]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 7514–7528, Online and Punta Cana, Domini- can Republic, 2021. Association for Computational Linguis- tics. 7

work page 2021

[11] [11]

Text2room: Extracting textured 3d meshes from 2d text-to-image models

Lukas H ¨ollein, Ang Cao, Andrew Owens, Justin Johnson, and Matthias Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7909–7920, 2023. 2

work page 2023

[12] [12]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InICLR, 2024. 2

work page 2024

[13] [13]

Mixed diffusion for 3d indoor scene synthesis.arXiv preprint arXiv:2405.21066, 2024

Siyi Hu, Diego Martin Arroyo, Stephanie Debats, Fabian Manhardt, Luca Carlone, and Federico Tombari. Mixed diffusion for 3d indoor scene synthesis.arXiv preprint arXiv:2405.21066, 2024. 3

work page arXiv 2024

[14] [14]

Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds

Team HY-World. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds. arXiv preprint, 2026. 8

work page 2026

[15] [15]

You only gaussian once: Controllable 3d gaussian splatting for ultra-densely sampled scenes, 2026

Jinrang Jia, Zhenjia Li, and Yifeng Shi. You only gaussian once: Controllable 3d gaussian splatting for ultra-densely sampled scenes, 2026. 2

work page 2026

[16] [16]

Multi- view pyramid transformer: Look coarser to see broader

Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, and Eunbyung Park. Multi- view pyramid transformer: Look coarser to see broader. arXiv preprint arXiv:2512.07806, 2025. 4, 8

work page arXiv 2025

[17] [17]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2

work page 2023

[18] [18]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023. 6

work page 2023

[19] [19]

Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation

Jialu Li and Mohit Bansal. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. InNeurIPS, 2023. 2

work page 2023

[20] [20]

Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In ICLR, 2024. 3

work page 2024

[21] [21]

Realsee3d: A large- scale multi-view rgb-d dataset of indoor scenes (version 1.0),

Linyuan Li, Yan Wu, Xi Li, Lingli Wang, Tong Rao, Jie Zhou, Cihui Pan, and Xinchen Hui. Realsee3d: A large- scale multi-view rgb-d dataset of indoor scenes (version 1.0),

work page

[22] [22]

M-lrm: Multi-view large re- construction model.arXiv preprint arXiv:2406.07648, 2024

Mengfei Li, Xiaoxiao Long, Yixun Liang, Weiyu Li, Yuan Liu, Peng Li, Yatian Wang, Xingqun Qi, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. M-lrm: Multi-view large re- construction model.arXiv preprint arXiv:2406.07648, 2024. 3

work page arXiv 2024

[23] [23]

Cameras as relative positional encod- ing

Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, and Angjoo Kanazawa. Cameras as relative positional encod- ing. InAdvances in Neural Information Processing Systems,

work page

[24] [24]

Depth any panoramas: A foundation model for panoramic depth estimation

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2026. 6

work page 2026

[25] [25]

Omniroam: World wandering via long-horizon panoramic video genera- tion.SIGGRAPH, 2026

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold-Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, Zifan Shi, and Yiwei Hu. Omniroam: World wandering via long-horizon panoramic video genera- tion.SIGGRAPH, 2026. 7, 2

work page 2026

[26] [26]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15086–15095, 2025. 7

work page 2025

[27] [27]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), pages 405–421. Springer, 2020. 2

work page 2020

[28] [28]

What makes for text to 360-degree panorama gener- ation with stable diffusion? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Jinhong Ni, Chang-Bin Zhang, Qiang Zhang, and Jing Zhang. What makes for text to 360-degree panorama gener- ation with stable diffusion? InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2

work page 2025

[29] [29]

Atiss: Autoregres- sive transformers for indoor scene synthesis

Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler. Atiss: Autoregres- sive transformers for indoor scene synthesis. InNeurIPS,

work page

[30] [30]

Pano2room: Novel view synthesis from a single indoor panorama

Guo Pu, Yiming Zhao, and Zhouhui Lian. Pano2room: Novel view synthesis from a single indoor panorama. In SIGGRAPH Asia 2024 Conference Papers, New York, NY , USA, 2024. Association for Computing Machinery. 7

work page 2024

[31] [31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 1

work page 2022

[32] [32]

Housediffusion: Vector floorplan generation via a diffusion model

Amin Shabani, Sepideh Hosseini, and Yasutaka Furukawa. Housediffusion: Vector floorplan generation via a diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5466–5475, 2023. 3

work page 2023

[33] [33]

Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaus- sian model for high-resolution 3d content creation. InECCV,

work page

[34] [34]

Diffuscene: Denoising dif- fusion probabilistic models for generative indoor scene syn- thesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Jus- tus Thies, and Matthias Nießner. Diffuscene: Denoising dif- fusion probabilistic models for generative indoor scene syn- thesis. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[35] [35]

Mvdiffusion: Enabling holistic multi- view image generation with correspondence-aware diffusion

Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. Mvdiffusion: Enabling holistic multi- view image generation with correspondence-aware diffusion. InNeurIPS, 2023. 2

work page 2023

[36] [36]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, et al. Seedream 4.0: Toward next- generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Plan2scene: Convert- ing floorplans to 3d scenes

Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X Chang, and Manolis Savva. Plan2scene: Convert- ing floorplans to 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10733–10742, 2021. 2, 3

work page 2021

[39] [39]

Moge-2: Accurate monocular geometry with metric scale and sharp details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdvances in Neural Information Processing Systems, 2025. 6

work page 2025

[40] [40]

Sceneformer: Indoor scene generation with transformers

Xin-Yang Wang, Yu-An Yeh, Che-Wei Tang, Anton Rob- bins, and Yu-Chiang Frank Wang. Sceneformer: Indoor scene generation with transformers. In2021 International Conference on 3D Vision (3DV), pages 106–115. IEEE,

work page

[41] [41]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page

[42] [42]

Pan- odiffusion: 360-degree panorama outpainting via diffusion

Tianhao Wu, Chuanxia Zheng, and Tat-Jen Cham. Pan- odiffusion: 360-degree panorama outpainting via diffusion. arXiv preprint arXiv:2307.03177, 2023. 2

work page arXiv 2023

[43] [43]

Adapt- splat: Adapting vision foundation models for feed-forward 3d gaussian splatting, 2026

Mingwei Xing, Xinliang Wang, and Yifeng Shi. Adapt- splat: Adapting vision foundation models for feed-forward 3d gaussian splatting, 2026. 8

work page 2026

[44] [44]

Taming stable diffusion for text to 360 degree panorama im- age generation

Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xi- aoshui Huang, Dinh Phung, Wanli Ouyang, and Jianfei Cai. Taming stable diffusion for text to 360 degree panorama im- age generation. InCVPR, pages 6347–6357, 2024. 2

work page 2024

[45] [45]

Pansplat: 4k panorama synthesis with feed-forward gaussian splatting

Cheng Zhang, Haofei Xu, Qianyi Wu, Camilo Cruz Gam- bardella, Dinh Phung, and Jianfei Cai. Pansplat: 4k panorama synthesis with feed-forward gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, 2025. 2

work page 2025

[46] [46]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7 PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis Supplemen...

work page 2018

[47] [47]

Without circular horizontal encoding, the generator treats the left and right panorama boundaries as distant image regions rather than adjacent rays

Visualization without Panoramic Position Encoding We include qualitative failure cases for the variant without Panoramic Position Encoding. Without circular horizontal encoding, the generator treats the left and right panorama boundaries as distant image regions rather than adjacent rays. This often produces inconsistent structures or textures across the ...

work page

[48] [48]

Baseline Adaptation Details 8.1. Pano2room Pano2room shares a broadly similar pipeline with our method, relying on monocular depth estimation to ob- tain a point cloud that is subsequently converted into a mesh, followed by iterative refinement through a render- then-estimate loop to progressively extend the scene to more distant regions. However, Pano2ro...

work page

[49] [49]

Figure 11 shows representative examples from 3D-FRONT

Training Data Visualization We further visualize the training data used by PanoWorld. Figure 11 shows representative examples from 3D-FRONT

work page

[50] [50]

The BEV maps illustrate the floorplan topology, room par- titions, doorway connectivity, sampled camera nodes, and local room groups used to construct the training views

and RealSee3D [21], including rendered panoramas, depth or shell-proxy images, and room-level BEV maps. The BEV maps illustrate the floorplan topology, room par- titions, doorway connectivity, sampled camera nodes, and local room groups used to construct the training views. These visualizations clarify the difference between syn- thetic CAD-derived scenes...

work page

[51] [51]

Additional Experimental Details We include implementation details that are useful for repro- ducing the evaluation but too specific for the main paper, including panorama resolution, node sampling rules, and overlap-mask construction for cross-view PSNR. 10.1. Cross-Node Consistency Evaluation We evaluate cross-node consistency on manually selected co-vis...

work page