pith. sign in

arxiv: 2606.13376 · v2 · pith:HQJS6JBUnew · submitted 2026-06-11 · 💻 cs.CV

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

Pith reviewed 2026-06-27 07:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords real-time video world modelingpanoramic Gaussian scaffoldsingle-image scene creation3D Gaussian representationdiffusion model distillationinteractive navigationtopology-aware panorama expansion
0
0 comments X

The pith

MoVerse turns one narrow-view image into a real-time navigable video world at 8 FPS on consumer hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a pipeline that builds an interactively explorable scene from a single limited photograph. It first completes the unobserved surroundings into a gravity-aligned 360-degree panorama using topology-aware diffusion. It then converts that panorama into a persistent 3D Gaussian scaffold through geometry-aware residual prediction. Finally, a distilled causal video renderer converts scaffold views into coherent video frames along any user-specified trajectory. The separation of explicit 3D construction from generative rendering yields both long-range geometric consistency and real-time performance on a single RTX 4090 GPU.

Core claim

MoVerse separates world construction from observation rendering by expanding the narrow input into a gravity-aligned 360 panorama with topology-aware diffusion, lifting the panorama into a dense 3D Gaussian scaffold via panoramic geometry-aware residual prediction, and translating scaffold renderings into photorealistic video through a Gaussian-conditioned renderer that is distilled from a bidirectional diffusion teacher into a causal autoregressive student for bounded-latency streaming.

What carries the argument

Panoramic Gaussian scaffold: the dense, directly renderable 3D spatial memory created from the completed panorama that supplies consistent geometry for subsequent video rendering.

If this is right

  • The explicit 3D scaffold supplies long-range consistency that pure generative video models lack.
  • User-specified camera trajectories can be followed controllably while maintaining temporal coherence.
  • Distillation from bidirectional teacher to causal student enables 8 FPS streaming on a single consumer GPU.
  • The pipeline combines the controllability of explicit 3D representations with the perceptual quality of generative video models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of scaffold construction from rendering could be applied to short video inputs to initialize richer initial geometry.
  • Adding simple dynamics on the Gaussian scaffold might allow basic object interactions without retraining the renderer.
  • Further compression of the student model could support deployment on lower-power devices for mobile scene exploration.

Load-bearing premise

The topology-aware diffusion reliably produces a geometrically consistent 360 panorama without errors that propagate into the later panoramic geometry-aware residual prediction.

What would settle it

Visible geometric drift, seams, or view-inconsistent artifacts appearing in the output video when the camera trajectory enters regions far outside the original narrow field of view.

read the original abstract

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents MoVerse, a real-time video world model that constructs an interactively navigable 3D scene from a single narrow-FOV image. It first applies topology-aware diffusion to expand the input into a gravity-aligned 360° panorama, then lifts the panorama into a persistent 3D Gaussian scaffold via panoramic geometry-aware residual prediction. A Gaussian-conditioned video renderer, trained via bidirectional diffusion teacher and distilled to a causal autoregressive student, translates scaffold renderings along user-specified trajectories into photorealistic video. The system claims 8 FPS real-time roaming on a single RTX 4090 GPU.

Significance. If the geometric consistency and real-time claims hold, the work offers a practical path to single-image world modeling that combines explicit 3D representations for controllability and long-range consistency with generative video models for perceptual quality. The separation of world construction from rendering and the distillation for bounded-latency streaming are notable design choices that could influence future interactive 3D generation systems.

major comments (2)
  1. [Abstract / pipeline description] The central claim of persistent, controllable 3D geometry from a single image depends on the topology-aware diffusion step producing outputs that are sufficiently geometrically consistent for the subsequent panoramic geometry-aware residual prediction. No quantitative metrics (e.g., depth seam error, gravity alignment error, or propagation to scaffold drift over long trajectories) are reported to validate this assumption, which is load-bearing for the scaffold's ability to support artifact-free roaming.
  2. [Abstract / results claim] The reported 8 FPS real-time performance on RTX 4090 is a key practical result, but the manuscript provides no breakdown of per-component timings (diffusion panorama completion, scaffold construction, renderer inference) or ablation on how distillation affects latency versus quality, making it impossible to assess whether the claimed speed is robust or tied to specific unstated implementation choices.
minor comments (2)
  1. Notation for 'Panoramic Gaussian Scaffold' and 'panoramic geometry-aware residual prediction' should be defined with explicit equations or pseudocode in the methods section to clarify how residuals are computed and applied.
  2. The abstract mentions 'gravity-aligned' panorama but does not specify the alignment mechanism or any failure cases when the input image lacks clear gravity cues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative validation of geometric consistency and detailed performance analysis. We address both major comments below and will incorporate additional metrics and breakdowns in the revised manuscript to better support the claims.

read point-by-point responses
  1. Referee: [Abstract / pipeline description] The central claim of persistent, controllable 3D geometry from a single image depends on the topology-aware diffusion step producing outputs that are sufficiently geometrically consistent for the subsequent panoramic geometry-aware residual prediction. No quantitative metrics (e.g., depth seam error, gravity alignment error, or propagation to scaffold drift over long trajectories) are reported to validate this assumption, which is load-bearing for the scaffold's ability to support artifact-free roaming.

    Authors: We recognize that explicit quantitative metrics for the geometric consistency of the topology-aware diffusion outputs would provide stronger support for the pipeline's assumptions. While the manuscript validates consistency through downstream visual quality, user studies on roaming, and qualitative panorama/scaffold results, we agree these specific metrics would directly address the concern. In the revised version, we will add depth seam error, gravity alignment error, and scaffold drift measurements over long trajectories, evaluated on a held-out validation set of scenes. revision: yes

  2. Referee: [Abstract / results claim] The reported 8 FPS real-time performance on RTX 4090 is a key practical result, but the manuscript provides no breakdown of per-component timings (diffusion panorama completion, scaffold construction, renderer inference) or ablation on how distillation affects latency versus quality, making it impossible to assess whether the claimed speed is robust or tied to specific unstated implementation choices.

    Authors: We agree that a per-component timing breakdown and distillation ablation are necessary to substantiate the real-time claim and allow assessment of robustness. In the revised manuscript, we will include a table with averaged inference times for panorama diffusion, scaffold construction, and renderer stages on the RTX 4090, along with an ablation comparing the bidirectional teacher and causal student models on both latency and quality metrics. This will clarify the contribution of distillation to achieving bounded-latency performance. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline uses external standard components

full rationale

The abstract and described pipeline separate world construction (topology-aware diffusion for 360° panorama, then panoramic geometry-aware residual prediction for Gaussian scaffold) from rendering (Gaussian-conditioned video model with teacher-student distillation). No equations, fitted parameters renamed as predictions, or self-citations are shown that reduce any load-bearing claim to its own inputs by construction. All steps invoke standard external techniques (diffusion models, 3D Gaussian representations) without self-referential definitions or uniqueness theorems imported from the authors' prior work. The real-time performance claim is presented as an empirical outcome on RTX 4090 hardware rather than a tautological derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on domain assumptions about diffusion consistency and Gaussian representability plus one invented entity; no explicit free parameters are named in the abstract.

axioms (1)
  • domain assumption Topology-aware diffusion can produce geometrically consistent 360 panoramas from narrow-FOV inputs without breaking downstream 3D lifting.
    Invoked in the first stage of world construction.
invented entities (1)
  • Panoramic Gaussian Scaffold no independent evidence
    purpose: Persistent dense 3D spatial memory that is directly renderable and supports controllable camera motion.
    New representation introduced to separate world construction from observation rendering.

pith-pipeline@v0.9.1-grok · 5780 in / 1222 out tokens · 24309 ms · 2026-06-27T07:19:43.472111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 13 linked inside Pith

  1. [1]

    Wonderjourney: Going from anywhere to everywhere,

    H.-X. Yu, H. Duan, J. Hur, K. Sargent, M. Rubinstein, W. T. Freeman, F. Cole, D. Sun, N. Snavely, J. Wuet al., “Wonderjourney: Going from anywhere to everywhere,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6658–6667

  2. [2]

    Wonderworld: Interactive 3d scene generation from a single image,

    H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu, “Wonderworld: Interactive 3d scene generation from a single image,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5916–5926. 17

  3. [3]

    Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,

    S. Yang, J. Tan, M. Zhang, T. Wu, G. Wetzstein, Z. Liu, and D. Lin, “Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,” inProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, 2025, pp. 1–10

  4. [4]

    Self-evolving 3d scene generation from a single image,

    K. Zheng, Y. Fan, J. Gu, Z. Xu, X. He, and X. E. Wang, “Self-evolving 3d scene generation from a single image,” arXiv preprint arXiv:2512.08905, 2025

  5. [5]

    Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,

    H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhanget al., “Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,”arXiv preprint arXiv:2507.21809, 2025

  6. [6]

    Worldexplorer: Towards generating fully navigable 3d scenes,

    M.-A. Schneider, L. Höllein, and M. Nießner, “Worldexplorer: Towards generating fully navigable 3d scenes,” in Proceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11

  7. [7]

    Matrix-3d: Omnidirectional explorable 3d world generation,

    Z. Yang, W. Ge, Y. Li, J. Chen, H. Li, M. An, F. Kang, H. Xue, B. Xu, Y. Yinet al., “Matrix-3d: Omnidirectional explorable 3d world generation,”arXiv preprint arXiv:2508.08086, 2025

  8. [8]

    Lyra 2.0: Explorable generative 3d worlds,

    T. Shen, S. Bahmani, K. He, S. G. Srinivasan, T. Cao, J. Ren, R. Li, Z. Wang, N. Sharp, Z. Gojcicet al., “Lyra 2.0: Explorable generative 3d worlds,”arXiv preprint arXiv:2604.13036, 2026

  9. [9]

    Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,

    T. HY-World, C. Cao, X. Zuo, Z. Wang, Y. Zhang, J. Wu, Z. Liu, Y. Gong, Y. Liu, B. Yuanet al., “Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,”arXiv preprint arXiv:2604.14268, 2026

  10. [10]

    Genie 3: A new frontier for world models,

    G. DeepMind, “Genie 3: A new frontier for world models,” https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

  11. [11]

    RTFM: A real-time frame model,

    WorldLabs, “RTFM: A real-time frame model,” https://www.worldlabs.ai/blog/rtfm, 2025

  12. [12]

    Context as memory: Scene-consistent interactive long video generation with memory retrieval,

    J. Yu, J. Bai, Y. Qin, Q. Liu, X. Wang, P. Wan, D. Zhang, and X. Liu, “Context as memory: Scene-consistent interactive long video generation with memory retrieval,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11

  13. [13]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model,

    X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Renet al., “Matrix-game 2.0: An open-source real-time and streaming interactive world model,”arXiv preprint arXiv:2508.13009, 2025

  14. [14]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,

    T. HunyuanWorld, “Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,”arXiv preprint, 2025

  15. [15]

    Yume: An interactive world generation model,

    X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang, “Yume: An interactive world generation model,”arXiv preprint arXiv:2507.17744, 2025

  16. [16]

    Relic: Interactive video world model with long-horizon memory,

    Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoffroy, M. Roberts, M. Fisher, E. Shechtmanet al., “Relic: Interactive video world model with long-horizon memory,”arXiv preprint arXiv:2512.04040, 2025

  17. [17]

    Advancing open-source world models,

    R. Team, Z. Gao, Q. Wang, Y. Zeng, J. Zhu, K. L. Cheng, Y. Li, H. Wang, Y. Xu, S. Maet al., “Advancing open-source world models,”arXiv preprint arXiv:2601.20540, 2026

  18. [18]

    Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,

    Z. Wang, Z. Liu, J. Li, K. Huang, B. Xu, F. Kang, M. An, P. Wang, B. Jiang, Y. Weiet al., “Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,”arXiv preprint arXiv:2604.08995, 2026

  19. [19]

    Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,

    H. Zhu, H. Liu, Y. Zhao, T. Ye, J. Chen, J. Yu, T. He, S. Han, and E. Xie, “Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,”arXiv preprint arXiv:2605.15178, 2026

  20. [20]

    Evoworld: Evolving panoramic world generation with explicit 3d memory,

    J. Wang, L. Ye, T. Lu, J. Xiao, J. Zhang, Y. Guo, X. Liu, R. Chellappa, C. Peng, A. Yuilleet al., “Evoworld: Evolving panoramic world generation with explicit 3d memory,”arXiv preprint arXiv:2510.01183, 2025

  21. [21]

    Gen3c: 3d-informed world-consistent video generation with precise camera control,

    X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao, “Gen3c: 3d-informed world-consistent video generation with precise camera control,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 6121–6132

  22. [22]

    Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,

    M. Yu, W. Hu, J. Xing, and Y. Shan, “Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2025, pp. 100–111

  23. [23]

    Mocam: Unified novel view synthesis via structured denoising dynamics,

    H. Liu, Y. Zhou, Z. Wang, Z. Xu, Z. Peng, J. Ma, J. Liang, S. He, and J. Li, “Mocam: Unified novel view synthesis via structured denoising dynamics,”arXiv preprint arXiv:2605.12119, 2026

  24. [24]

    One2scene: Geometric consistent explorable 3d scene generation from a single image,

    P. Wang, L. Chen, Z. Ma, Y. Guo, G. Zhang, and L. Zhang, “One2scene: Geometric consistent explorable 3d scene generation from a single image,”arXiv preprint arXiv:2602.19766, 2026

  25. [25]

    Inspatio-worldfm: An open-source real-time generative frame model,

    I. Team, D. Shen, G. Zhang, H. Liu, H. Ji, J. Liu, J. Guo, N. Wang, S. Pan, W. Panet al., “Inspatio-worldfm: An open-source real-time generative frame model,”arXiv preprint arXiv:2603.11911, 2026. 18

  26. [26]

    Panodiffusion: 360-degree panorama outpainting via diffusion,

    T. Wu, C. Zheng, and T.-J. Cham, “Panodiffusion: 360-degree panorama outpainting via diffusion,” inICLR, 2024

  27. [27]

    Dit360: High-fidelity panoramic image generation via hybrid training,

    H. Feng, D. Zhang, X. Li, B. Du, and L. Qi, “Dit360: High-fidelity panoramic image generation via hybrid training,”arXiv preprint arXiv:2510.11712, 2025

  28. [28]

    Panorama generation from nfov image done right,

    D. Zheng, C. Zhang, X.-M. Wu, C. Li, C. Lv, J.-F. Hu, and W.-S. Zheng, “Panorama generation from nfov image done right,” inCVPR, 2025, pp. 21610–21619

  29. [29]

    Camfreediff: camera-free image to panorama generation with diffusion model,

    X. Yuan, S. Tang, K. Li, and P. Wang, “Camfreediff: camera-free image to panorama generation with diffusion model,” inCVPR, 2025, pp. 16408–16417

  30. [30]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,

    L. Jiang, Y. Mao, L. Xu, T. Lu, K. Ren, Y. Jin, X. Xu, M. Yu, J. Pang, F. Zhaoet al., “Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,”ACM Transactions on Graphics (TOG), vol. 44, no. 6, pp. 1–16, 2025

  31. [31]

    Vg3t: Visual geometry grounded gaussian transformer,

    J. Kim and S. Lee, “Vg3t: Visual geometry grounded gaussian transformer,”arXiv preprint arXiv:2512.05988, 2025

  32. [32]

    Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,

    Z. Chen, C. Wu, Z. Shen, C. Zhao, W. Ye, H. Feng, E. Ding, and S.-H. Zhang, “Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21590–21599

  33. [33]

    Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,

    C. Zhang, H. Xu, Q. Wu, C. C. Gambardella, D. Phung, and J. Cai, “Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11437–11447

  34. [34]

    Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,

    J. Ren, M. Xiang, J. Zhu, and Y. Dai, “Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28959–28969

  35. [35]

    Sharp monocular view synthesis in less than a second,

    L. Mescheder, W. Dong, S. Li, X. Bai, M. Santos, P. Hu, B. Lecouat, M. Zhen, A. Delaunoy, T. Fanget al., “Sharp monocular view synthesis in less than a second,”arXiv preprint arXiv:2512.10685, 2025

  36. [36]

    Wan: Open and advanced large-scale video generative models,

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  37. [37]

    Cogvideox: Text-to-video diffusion models with an expert transformer,

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

  38. [38]

    Hunyuanvideo: A systematic framework for large video generative models,

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

  39. [39]

    Pluralistic image completion,

    C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic image completion,” inCVPR, 2019, pp. 1438–1447

  40. [40]

    Wide-context semantic image extrapolation,

    Y. Wang, X. Tao, X. Shen, and J. Jia, “Wide-context semantic image extrapolation,” inCVPR, 2019, pp. 1399–1408

  41. [41]

    Large scale image completion via co-modulated generative adversarial networks,

    S. Zhao, J. Cui, Y. Sheng, Y. Dong, X. Liang, E. I. Chang, and Y. Xu, “Large scale image completion via co-modulated generative adversarial networks,”arXiv preprint arXiv:2103.10428, 2021

  42. [42]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10684–10695

  43. [43]

    Repaint: Inpainting using denoising diffusion probabilistic models,

    A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inCVPR, 2022, pp. 11461–11471

  44. [44]

    Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,

    J. Li and M. Bansal, “Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,”NeurIPS, vol. 36, pp. 21878–21894, 2023

  45. [45]

    Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,

    M. Feng, J. Liu, M. Cui, and X. Xie, “Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,”arXiv preprint arXiv:2311.13141, 2023

  46. [46]

    Matrix3d: Large photogrammetry model all-in-one,

    Y. Lu, J. Zhang, T. Fang, J.-D. Nahmias, Y. Tsin, L. Quan, X. Cao, Y. Yao, and S. Li, “Matrix3d: Large photogrammetry model all-in-one,” inCVPR, 2025, pp. 11250–11263

  47. [47]

    Syncdiffusion: Coherent montage via synchronized joint diffusions,

    Y. Lee, K. Kim, H. Kim, and M. Sung, “Syncdiffusion: Coherent montage via synchronized joint diffusions,” NeurIPS, vol. 36, pp. 50648–50660, 2023

  48. [48]

    360dvd: Controllable panorama video generation with 360-degree video diffusion model,

    Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang, “360dvd: Controllable panorama video generation with 360-degree video diffusion model,” inCVPR, 2024, pp. 6913–6923

  49. [49]

    Cylin-painting: Seamless 360 panoramic image outpainting and beyond,

    K. Liao, X. Xu, C. Lin, W. Ren, Y. Wei, and Y. Zhao, “Cylin-painting: Seamless 360 panoramic image outpainting and beyond,”IEEE TIP, vol. 33, pp. 382–394, 2023

  50. [50]

    Spatial transformer networks,

    M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,”NeurIPS, vol. 28, 2015. 19

  51. [51]

    Recognizing scene viewpoint using panoramic place represen- tation,

    J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place represen- tation,” inCVPR. IEEE, 2012, pp. 2695–2702

  52. [52]

    Matterport3d: Learning from rgb-d data in indoor environments,

    A.Chang, A.Dai, T.Funkhouser, M.Halber, M.Niebner, M.Savva, S.Song, A.Zeng, andY.Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” in2017 International Conference on 3D Vision (3DV). IEEE Computer Society, 2017, pp. 667–676

  53. [53]

    Poly haven hdris,

    Poly Haven, “Poly haven hdris,” https://polyhaven.com/hdris, accessed: December 2025

  54. [54]

    Panocontext: A whole-room 3d context model for panoramic scene understanding,

    Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” inECCV. Springer, 2014, pp. 668–686

  55. [55]

    Layoutnet: Reconstructing the 3d room layout from a single rgb image,

    C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” inCVPR, 2018, pp. 2051–2059

  56. [56]

    Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,

    C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,” inCVPR, 2019, pp. 1047–1056

  57. [57]

    Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,

    Z. Jiang, Z. Xiang, J. Xu, and M. Zhao, “Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,” inCVPR, 2022, pp. 1654–1663

  58. [58]

    Hohonet: 360 indoor holistic understanding with latent horizontal features,

    C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” in CVPR, 2021, pp. 2573–2582

  59. [59]

    Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

    S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, ...

  60. [60]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion,

    X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”Advances in Neural Information Processing Systems, vol. 38, pp. 167283–167308, 2026

  61. [61]

    One-step diffusion with distribution matching distillation

    T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park, “One-step diffusion with distribution matching distillation.” inCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6613–6623

  62. [62]

    Raven: Real-time autoregressive video extrapolation with consistency-model grpo,

    Y. Lu, R. Zuo, and J. Deng, “Raven: Real-time autoregressive video extrapolation with consistency-model grpo,” arXiv preprint arXiv:2605.15190, 2026

  63. [63]

    Memrope: Training-free infinite video generation via evolving memory tokens,

    Y. Kim, Q. Hu, C.-C. J. Kuo, and P. A. Beerel, “Memrope: Training-free infinite video generation via evolving memory tokens,”arXiv preprint arXiv:2603.12513, 2026

  64. [64]

    Taehv: Tiny autoencoder for hunyuan video,

    O. Boer Bohan, “Taehv: Tiny autoencoder for hunyuan video,” https://github.com/madebyollin/taehv, 2025. 20