pith. machine review for the scientific record. sign in

arxiv: 2605.14135 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords panoramic completionnovel view synthesis3D Gaussian splattingattention steeringindoor scenesdiffusion modelssparse view reconstruction
0
0 comments X

The pith

PanoPlane achieves high-fidelity indoor novel view synthesis from sparse inputs by using plane-aware panoramic completion to supervise 3D Gaussian Splatting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PanoPlane shows that panoramic 360-degree image completion can reconstruct closed indoor room geometry more effectively than methods limited to narrow perspective views. The approach introduces a training-free technique called Layout Anchored Attention Steering that directs the diffusion model's focus to detected planar surfaces, allowing grounded extrapolation into unobserved areas instead of free-form generation. This produces consistent panoramic scenes that serve as training data for 3D Gaussian Splatting, resulting in accurate novel view rendering from only three to nine input images. Readers should care because it lowers the barrier for creating detailed 3D models of real rooms with minimal photography.

Core claim

The paper claims that by anchoring the attention maps in a pre-trained diffusion model to the planar layouts detected in observed panoramic regions, the generated completions maintain geometric consistency across the full 360 degrees, which in turn allows 3D Gaussian Splatting to synthesize high-quality novel views without any fine-tuning of the generative model, as validated by superior PSNR, SSIM, and LPIPS metrics on Replica, ScanNet++, and Matterport3D.

What carries the argument

Layout Anchored Attention Steering, which steers attention within the diffusion model's internal representation toward the scene's detected planar surfaces at inference time to enforce geometric consistency in extrapolations.

If this is right

  • Supports accurate novel-view synthesis with as few as three input views.
  • Delivers up to 17.8 percent PSNR gains over prior state-of-the-art methods on indoor benchmarks.
  • Requires no training or fine-tuning of the underlying diffusion model.
  • Enables reconstruction of closed room geometry through full panoramic completion.
  • Grounds the generative process in observed planar structures to minimize artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique might generalize to other generative tasks where geometric priors can guide inference-time behavior.
  • It points toward hybrid pipelines that combine classical geometry detection with modern diffusion models for 3D vision.
  • Future work could test the method on dynamic scenes or with moving objects to assess robustness.

Load-bearing premise

Steering attention in the diffusion model toward detected planar surfaces will reliably produce geometrically consistent extrapolations in unobserved regions without artifacts or inconsistencies.

What would settle it

A test scene where the synthesized novel views show systematic geometric errors, such as warped walls or mismatched floor planes that contradict the input measurements, would disprove the claim.

Figures

Figures reproduced from arXiv: 2605.14135 by Adil Qureshi, Dinesh Manocha, Dongki Jung, Jaehoon Choi.

Figure 1
Figure 1. Figure 1: PanoPlane takes sparse input views of an indoor scene, renders a panoramic image with [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PanoPlane. From sparse input views, we recover layout planes, render partial panoramas with holes, assign hole tokens to planes via ray-plane intersection and boundary based assignment, and steer DiT attention to produce geometrically grounded completions. The completed panoramas are converted into cubemap supervision for refined 2DGS reconstruction. to model the 0 ◦/360◦ longitude seam as a co… view at source ↗
Figure 3
Figure 3. Figure 3: Panoramic Completion. (a) A partially observed equirectangular panorama rendered from the initial 2DGS, with unobserved regions shown in black. (b) Completion with our layout-anchored steering: unobserved regions are filled as geometrically consistent extensions of the surrounding walls, floor, and ceiling. (c) Naive panoramic inpainting without steering produces hallucinated surfaces that are geometricall… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on Matterport3D, Replica and ScanNet++ Each row shows a challenging scene with large unobserved regions. Perspective-based methods (3DGS, SparseGS, FSGS, InstantSplat) artifacts in wide baseline views. Generative-prior methods partially recover scenes but still struggle with color bleeding and over-smooth textures. PanoPlane (ours) recovers geometrically consistent surfaces across al… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis PSNR vs steering strength λ on Replica. Performance peaks at λ = 0.4; stronger steering over￾constrains the model, produc￾ing flat textureless comple￾tions. Steering Strength [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure case: When the rendered 2DGS panorama contains significant artifacts, DiT360 fails to produce coherent inpainting and layout-anchored steering worsens the result by reinforcing incorrect plane assignments. While PanoPlane requires no per-scene fine-tuning or training of the diffusion model, it inherits the limitations of the underlying foundation model. DiT360 [9] is trained on large-scale panorami… view at source ↗
Figure 7
Figure 7. Figure 7: Additional Qualitative Results on Replica, Matterport3D and Scannet++ [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative results of layout-anchored steering: Without steering (Naive), DiT360 inpaints unobserved regions according to its learned distribution, often hallucinating incorrect geometry such as receding corridors or warped surfaces. With our layout-anchored steering (Ours), the same regions are completed as flat extensions of the detected walls, ceilings, and floors, producing geometrically co… view at source ↗
read the original abstract

We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PanoPlane for sparse-view indoor novel view synthesis via 360° panoramic scene completion. It proposes Layout Anchored Attention Steering, a training-free mechanism that directs a diffusion model's attention toward detected planar surfaces at inference time to produce grounded extrapolations, which then supervise 3D Gaussian Splatting. Experiments on Replica, ScanNet++, and Matterport3D claim state-of-the-art results for 3/6/9 input views, with up to +17.8% PSNR gains over baselines without diffusion fine-tuning.

Significance. If the geometric consistency of the steered completions holds, the approach would be significant for training-free sparse-view indoor reconstruction: it leverages full panoramic layout context and existing diffusion models to reduce hallucination in unobserved regions, offering a practical way to improve 3DGS supervision from very few views.

major comments (2)
  1. [Experiments] Experiments section: only novel-view PSNR/SSIM/LPIPS are reported; no direct metrics (multi-view depth consistency, plane-normal error, or reprojection error on completed regions) are provided to verify that Layout Anchored Attention Steering produces 3D-geometrically consistent extrapolations rather than plausible 2D textures.
  2. [Method] Method section (Layout Anchored Attention Steering): the mechanism is described at a high level but lacks precise implementation details on attention-layer selection, steering weights, and integration with plane detection, which are load-bearing for reproducing the claimed geometric grounding.
minor comments (1)
  1. [Abstract] Abstract: the '+17.8% PSNR' claim does not specify the exact baseline method or input-view count at which the maximum gain occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of geometric consistency and reproducibility.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: only novel-view PSNR/SSIM/LPIPS are reported; no direct metrics (multi-view depth consistency, plane-normal error, or reprojection error on completed regions) are provided to verify that Layout Anchored Attention Steering produces 3D-geometrically consistent extrapolations rather than plausible 2D textures.

    Authors: We agree that direct geometric metrics on the completed regions would provide stronger validation of the 3D consistency achieved by Layout Anchored Attention Steering. In the revised manuscript, we will add quantitative evaluations including multi-view depth consistency (measured via reprojection error across held-out views) and plane-normal error on extrapolated surfaces, using ground-truth geometry from Replica and ScanNet++ where available. These will be reported alongside the existing NVS metrics to directly address the concern. revision: yes

  2. Referee: [Method] Method section (Layout Anchored Attention Steering): the mechanism is described at a high level but lacks precise implementation details on attention-layer selection, steering weights, and integration with plane detection, which are load-bearing for reproducing the claimed geometric grounding.

    Authors: We acknowledge that additional implementation specifics are needed for full reproducibility. The revised Method section will detail the exact attention layers targeted (specifically the mid-level cross-attention blocks in the diffusion U-Net), the steering weight schedule (fixed at 1.2 for planar regions with linear decay), and the precise integration pipeline with the plane detection module, including pseudocode and hyperparameter values used in all experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies external pre-trained diffusion model with new inference-time steering and validates on independent benchmarks

full rationale

The derivation chain begins with standard plane detection on sparse input panoramas, applies a training-free attention-steering rule (Layout Anchored Attention Steering) inside an off-the-shelf diffusion model, produces completed panoramas, and feeds those images as supervision to 3D Gaussian Splatting. All quantitative claims are empirical comparisons (PSNR/SSIM/LPIPS on Replica, ScanNet++, Matterport3D) against external baselines; no parameter is fitted to the target metric and then re-labeled as a prediction, no uniqueness theorem is imported from the authors' prior work, and no equation reduces to a self-definition. The approach therefore remains self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach builds on standard diffusion models and plane detection without detailing additional assumptions.

pith-pipeline@v0.9.0 · 5506 in / 1014 out tokens · 35818 ms · 2026-05-15T05:00:27.260109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 5 internal anchors

  1. [1]

    Self-rectifying diffusion sampling with perturbed-attention guidance

    Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InEuropean Conference on Computer Vision (ECCV), pages 1–17. Springer, 2024. doi: 10.1007/978-3-031-73464-9_1

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, 2023

  4. [4]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

  5. [5]

    Quantifying and alleviating co-adaptation in sparse-view 3d gaussian splatting

    Kangjie Chen, Yingji Zhong, Zhihao Li, Jiaqi Lin, Youyu Chen, Minghan Qin, and Haoqian Wang. Quantifying and alleviating co-adaptation in sparse-view 3d gaussian splatting. In Advances in Neural Information Processing Systems, 2025

  6. [6]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

  7. [7]

    MASt3r-sfm: a fully-integrated solution for unconstrained structure-from- motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r-sfm: a fully-integrated solution for unconstrained structure-from- motion. InInternational Conference on 3D Vision 2025, 2025. URL https://openreview. net/forum?id=5uw1GRBFoT

  8. [8]

    arXiv preprint arXiv:2403.20309 (2024)

    Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Sparse-view gaussian splatting in seconds.arXiv preprint arXiv:2403.20309, 2024

  9. [9]

    Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

    Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, and Lu Qi. Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

  10. [10]

    Fischler and Robert C

    Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. doi: 10.1145/358669.358692

  11. [11]

    Seitz, and Richard Szeliski

    Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Manhattan-world stereo. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1422–1429, 2009. doi: 10.1109/CVPRW.2009.5206867

  12. [12]

    Matcha gaussians: Atlas of charts for high-quality geometry and photorealism from sparse views

    Antoine Guedon, Tomoki Ichikawa, Kohei Yamashita, and Ko Nishino. Matcha gaussians: Atlas of charts for high-quality geometry and photorealism from sparse views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6001–6011, June 2025

  13. [13]

    Igl-nav: Incremental 3d gaussian localization for image-goal navigation

    Wenxuan Guo, Xiuwei Xu, Hang Yin, Ziwei Wang, Jianjiang Feng, Jie Zhou, and Jiwen Lu. Igl-nav: Incremental 3d gaussian localization for image-goal navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6808–6817, October 2025. 10

  14. [14]

    Prompt-to-prompt image editing with cross attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InInternational Conference on Learning Representations (ICLR), 2023

  15. [15]

    Local path opti- mization in the latent space using learned distance gradient

    Kohei Honda, Takeshi Ishita, Yasuhiro Yoshimura, and Ryo Yonetani. Gsplatvnm: Point- of-view synthesis for visual navigation models using gaussian splatting. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 20869–20876, 2025. doi: 10.1109/IROS60139.2025.11246997

  16. [16]

    Improving sample quality of diffusion models using self-attention guidance

    Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion models using self-attention guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7462–7471, 2023

  17. [17]

    2d gaussian splatting for geometrically accurate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. doi: 10.1145/3641519.3657428

  18. [18]

    GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

    Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, and Yu-Lun Liu. Gamo: Geometry-aware multi-view diffusion outpainting for sparse-view 3d reconstruction.arXiv preprint arXiv:2512.25073, 2025

  19. [19]

    Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds.arXiv preprint, 2026

    Team HY-World. Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds.arXiv preprint, 2026

  20. [20]

    Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis

    Youngkyoon Jang and Eduardo Pérez-Pellitero. Comapgs: Covisibility map-based gaussian splatting for sparse novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26779–26788, June 2025

  21. [21]

    Ac- tiveGS: Active Scene Reconstruction Using Gaussian Splatting.IEEE Robotics and Automation Letters, 10(5):4866–4873, 2025

    Liren Jin, Xingguang Zhong, Yue Pan, Jens Behley, Cyrill Stachniss, and Marija Popovi´c. Ac- tiveGS: Active Scene Reconstruction Using Gaussian Splatting.IEEE Robotics and Automation Letters, 10(5):4866–4873, 2025. doi: 10.1109/LRA.2025.3555149

  22. [22]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

  23. [23]

    2023 , url =

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023. doi: 10.1109/ICCV51070.2023.00371

  24. [24]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

  25. [25]

    Lee, Martial Hebert, and Takeo Kanade

    David C. Lee, Martial Hebert, and Takeo Kanade. Geometric reasoning for single image structure recovery. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2136–2143, 2009. doi: 10.1109/CVPR.2009.5206872

  26. [26]

    Gaussnav: Gaussian splatting for visual navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5): 4108–4121, 2025

    Xiaohan Lei, Min Wang, Wengang Zhou, and Houqiang Li. Gaussnav: Gaussian splatting for visual navigation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5): 4108–4121, 2025. doi: 10.1109/TPAMI.2025.3538496

  27. [27]

    Dual-channel attention guidance for training-free image editing control in diffusion transformers, 2026

    Guandong Li and Mengxia Ye. Dual-channel attention guidance for training-free image editing control in diffusion transformers, 2026. URLhttps://arxiv.org/abs/2602.18022

  28. [28]

    Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization

    Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20775–20785, 2024. 11

  29. [29]

    Freecontrol: Efficient, training-free structural control via one-step attention extraction

    Jiang Lin, Xinyu Chen, Song Wu, Zhiqiu Zhang, Jizhi Zhang, Ye Wang, Qiang Tang, Qian Wang, Jian Yang, and Zili Yi. Freecontrol: Efficient, training-free structural control via one-step attention extraction. InAdvances in Neural Information Processing Systems, 2025

  30. [30]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

  31. [31]

    PlaneRCNN: 3d plane detection and reconstruction from a single image

    Chen Liu, Kihwan Kim, Jinwei Gu, Yasutaka Furukawa, and Jan Kautz. PlaneRCNN: 3d plane detection and reconstruction from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4450–4459, 2019. doi: 10.1109/CVPR.2019.00458

  32. [32]

    Reconx: Reconstruct any scene from sparse views with video diffusion model

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model. IEEE Transactions on Image Processing, 35:2305–2319, 2026. doi: 10.1109/TIP.2026.3666733

  33. [33]

    Deceptive- nerf/3dgs: Diffusion-generated pseudo-observations for high-quality sparse-view reconstruction

    Xinhang Liu, Jiaben Chen, Shiu-Hong Kao, Yu-Wing Tai, and Chi-Keung Tang. Deceptive- nerf/3dgs: Diffusion-generated pseudo-observations for high-quality sparse-view reconstruction. InComputer Vision – ECCV 2024, pages 337–355. Springer Nature Switzerland, 2024. doi: 10.1007/978-3-031-72640-8_19

  34. [34]

    Omniroam: World wandering via long-horizon panoramic video generation.SIGGRAPH, 2026

    Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold- Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, Zifan Shi, and Yiwei Hu. Omniroam: World wandering via long-horizon panoramic video generation.SIGGRAPH, 2026

  35. [35]

    You see it, you got it: Learning 3d creation on pose-free videos at scale

    Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, and Xinlong Wang. You see it, you got it: Learning 3d creation on pose-free videos at scale. In IEEE/CVF conference on computer vision and pattern recognition, 2025

  36. [36]

    Splatfields: Neural gaussian splats for sparse 3d and 4d reconstruction

    Marko Mihajlovic, Sergey Prokudin, Siyu Tang, Robert Maier, Federica Bogo, Tony Tung, and Edmond Boyer. Splatfields: Neural gaussian splats for sparse 3d and 4d reconstruction. InCom- puter Vision – ECCV 2024, pages 313–332. Springer, 2024. doi: 10.1007/978-3-031-72627-9_ 18

  37. [37]

    G4splat: Geometry-guided gaussian splatting with generative prior

    Junfeng Ni, Yixin Chen, Zhifei Yang, Yu Liu, Ruijie Lu, Song-Chun Zhu, and Siyuan Huang. G4splat: Geometry-guided gaussian splatting with generative prior. InThe Fourteenth Interna- tional Conference on Learning Representations, 2026

  38. [38]

    MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

    Michael Niemeyer, Jonathan T. Barron, Ben Mildenhall, Mehdi S. M. Sajjadi, Andreas Geiger, and Noha Radwan. RegNeRF: Regularizing neural radiance fields for view synthesis from sparse inputs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5470–5480, 2022. doi: 10.1109/CVPR52688.2022.00540

  39. [39]

    Ri3d: Few-shot gaussian splatting with repair and inpainting diffusion priors

    Avinash Paliwal, Xilong Zhou, Wei Ye, Jinhui Xiong, Rakesh Ranjan, and Nima Khademi Kalantari. Ri3d: Few-shot gaussian splatting with repair and inpainting diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  40. [40]

    Dropgaussian: Structural regularization for sparse-view gaussian splatting

    Hyunwoo Park, Gun Ryu, and Wonjun Kim. Dropgaussian: Structural regularization for sparse-view gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21600–21609, 2025

  41. [41]

    Ad-gs: Alternating densification for sparse-input 3d gaussian splatting

    Gurutva Patle, Nilay Girgaonkar, Nagabhushan Somraj, and Rajiv Soundararajan. Ad-gs: Alternating densification for sparse-input 3d gaussian splatting. InProceedings of the SIG- GRAPH Asia 2025 Conference Papers. Association for Computing Machinery, 2025. doi: 10.1145/3757377.3763993

  42. [42]

    Semantic image inversion and editing using rectified stochastic differential equations

    L Rout, Y Chen, N Ruiz, C Caramanis, S Shakkottai, and W Chu. Semantic image inversion and editing using rectified stochastic differential equations. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=Hu0FSOSEyS. 12

  43. [43]

    D 2GS: Depth-and-density guided gaussian splatting for stable and accurate sparse-view reconstruction

    Meixi Song, Xin Lin, Dizhe Zhang, Haodong Li, Xiangtai Li, Bo Du, and Lu Qi. D 2GS: Depth-and-density guided gaussian splatting for stable and accurate sparse-view reconstruction. InInternational Conference on Learning Representations, 2026

  44. [44]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019

  45. [45]

    HorizonNet: Learning room layout with 1d representation and pano stretch data augmentation

    Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-Tzong Chen. HorizonNet: Learning room layout with 1d representation and pano stretch data augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1047–1056,

  46. [46]

    doi: 10.1109/CVPR.2019.00114

  47. [47]

    Attentive eraser: Unleashing dif- fusion model’s object removal potential via self-attention redirection guidance

    Wenhao Sun, Xue-Mei Dong, Benlei Cui, and Jingqun Tang. Attentive eraser: Unleashing dif- fusion model’s object removal potential via self-attention redirection guidance. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 20734–20742, 2025. URL https://ojs.aaai.org/index.php/AAAI/article/view/34285

  48. [48]

    Murat Tekalp, and Federico Tombari

    Atakan Topalo˘glu, Kunyi Li, Michael Niemeyer, Nassir Navab, A. Murat Tekalp, and Federico Tombari. Oraclegs: Grounding generative priors for sparse-view gaussian splatting. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 77–87, March 2026

  49. [49]

    Plug-and-play diffusion features for text-driven image-to-image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023

  50. [50]

    Dn-splatter: Depth and normal priors for gaussian splatting and meshing

    Matias Turkulainen, Xuqian Ren, Iaroslav Melekhov, Otto Seiskari, Esa Rahtu, and Juho Kannala. Dn-splatter: Depth and normal priors for gaussian splatting and meshing. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2421–2431, 2025. doi: 10.1109/W ACV61041.2025.00241

  51. [51]

    2023 , url =

    Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. SparseNeRF: Distilling depth ranking for few-shot novel view synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9065–9076, October 2023. doi: 10.1109/ ICCV51070.2023.00832

  52. [52]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. doi: 10.1109/CVPR52733. 2024.01956. URL https://openaccess.thecvf.com/content/CVPR2024/html/Wang_ DUSt3R_Geometric_3D_Vis...

  53. [53]

    π3: Permutation-equivariant visual geometry learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InInternational Conference on Learning Representations (ICLR), 2026

  54. [54]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. doi: 10.1109/TIP.2003.819861

  55. [55]

    PanoDiffusion: 360-degree panorama outpainting via diffusion

    Tianhao Wu, Chuanxia Zheng, and Tat-Jen Cham. PanoDiffusion: 360-degree panorama outpainting via diffusion. InThe Twelfth International Conference on Learning Representations, 2024

  56. [56]

    Fleet, Marcus A

    Ziyi Wu, Daniel Watson, Andrea Tagliasacchi, David J. Fleet, Marcus A. Brubaker, and Saurabh Saxena. 360Anything: Geometry-free lifting of images and videos to 360°.arXiv, 2026

  57. [57]

    Sparse view synthesis using 3d gaussian splatting.arXiv preprint arXiv:2312.00206, 2025

    Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. Sparse view synthesis using 3d gaussian splatting.arXiv preprint arXiv:2312.00206, 2025. doi: 10.48550/arXiv.2312.00206. 13

  58. [58]

    Scannet++: A high- fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  59. [59]

    Fewviewgs: Gaussian splatting with few view matching and multi-stage training

    Ruihong Yin, Vladimir Yugay, Yue Li, Sezer Karaoglu, and Theo Gevers. Fewviewgs: Gaussian splatting with few view matching and multi-stage training. InAdvances in Neural Information Processing Systems, volume 37, 2024

  60. [60]

    Fregs: 3d gaussian splatting with progressive frequency regularization

    Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, and Eric Xing. Fregs: 3d gaussian splatting with progressive frequency regularization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21424–21433, 2024

  61. [61]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023

  62. [62]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018. doi: 10.1109/CVPR.2018.00068

  63. [63]

    3dgsnav: Enhancing vision-language model reasoning for object navigation via active 3d gaussian splatting, 2026

    Wancai Zheng, Hao Chen, Xianlong Lu, Linlin Ou, and Xinyi Yu. 3dgsnav: Enhancing vision-language model reasoning for object navigation via active 3d gaussian splatting, 2026

  64. [64]

    Nexusgs: Sparse view synthesis with epipolar depth priors in 3d gaussian splatting

    Yulong Zheng, Zicheng Jiang, Shengfeng He, Yandu Sun, Junyu Dong, Huaidong Zhang, and Yong Du. Nexusgs: Sparse view synthesis with epipolar depth priors in 3d gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  65. [65]

    Taming video diffusion prior with scene-grounding guidance for 3d gaussian splatting from sparse inputs

    Yingji Zhong, Zhihao Li, Dave Zhenyu Chen, Lanqing Hong, and Dan Xu. Taming video diffusion prior with scene-grounding guidance for 3d gaussian splatting from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6133–6143, 2025

  66. [66]

    Learning to reconstruct 3d manhattan wireframes from a single image

    Yichao Zhou, Haozhi Qi, Yuexiang Zhai, Qi Sun, Zhili Chen, Li-Yi Wei, and Yi Ma. Learning to reconstruct 3d manhattan wireframes from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7698–7707, 2019. doi: 10.1109/ ICCV .2019.00779

  67. [67]

    Fsgs: Real-time few-shot view synthesis using gaussian splatting

    Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. Fsgs: Real-time few-shot view synthesis using gaussian splatting. InComputer Vision – ECCV 2024, pages 145–163. Springer,

  68. [68]

    14 Appendix In this appendix, we provide additional discussion, experimental results, and technical details: implementation details (Sec

    doi: 10.1007/978-3-031-72933-1_9. 14 Appendix In this appendix, we provide additional discussion, experimental results, and technical details: implementation details (Sec. A), Failure cases (Sec. B), and additional qualitative and quantitative results (Sec. D, C). A Implementation Details General ConfigurationAll experiments are conducted on a NVIDIA A600...

  69. [69]

    What does this region look like?

  70. [70]

    Is this region located on a wall, floor, ceiling or some other surface?

  71. [71]

    wall”, “floor

    Give your final answer as a single word: wall, floor, ceiling, bed, table, shelf, cabinet, window, door, or other. The model generates up to 200 tokens of reasoning. We parse the response by finding thelast occurrence of any label keyword; if the final keyword is “wall”, “floor”, or “ceiling”, the plane is labeledlayout, otherwisenon-layout. Using the las...