MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

Haofeng Liu; Jing Li; Jun Liang; Shengfeng He; Yang Zhou; Yuqin Lu; Ziheng Wang

arxiv: 2606.13376 · v2 · pith:HQJS6JBUnew · submitted 2026-06-11 · 💻 cs.CV

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

Yang Zhou , Ziheng Wang , Yuqin Lu , Haofeng Liu , Jun Liang , Shengfeng He , Jing Li This is my paper

Pith reviewed 2026-06-27 07:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords real-time video world modelingpanoramic Gaussian scaffoldsingle-image scene creation3D Gaussian representationdiffusion model distillationinteractive navigationtopology-aware panorama expansion

0 comments

The pith

MoVerse turns one narrow-view image into a real-time navigable video world at 8 FPS on consumer hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a pipeline that builds an interactively explorable scene from a single limited photograph. It first completes the unobserved surroundings into a gravity-aligned 360-degree panorama using topology-aware diffusion. It then converts that panorama into a persistent 3D Gaussian scaffold through geometry-aware residual prediction. Finally, a distilled causal video renderer converts scaffold views into coherent video frames along any user-specified trajectory. The separation of explicit 3D construction from generative rendering yields both long-range geometric consistency and real-time performance on a single RTX 4090 GPU.

Core claim

MoVerse separates world construction from observation rendering by expanding the narrow input into a gravity-aligned 360 panorama with topology-aware diffusion, lifting the panorama into a dense 3D Gaussian scaffold via panoramic geometry-aware residual prediction, and translating scaffold renderings into photorealistic video through a Gaussian-conditioned renderer that is distilled from a bidirectional diffusion teacher into a causal autoregressive student for bounded-latency streaming.

What carries the argument

Panoramic Gaussian scaffold: the dense, directly renderable 3D spatial memory created from the completed panorama that supplies consistent geometry for subsequent video rendering.

If this is right

The explicit 3D scaffold supplies long-range consistency that pure generative video models lack.
User-specified camera trajectories can be followed controllably while maintaining temporal coherence.
Distillation from bidirectional teacher to causal student enables 8 FPS streaming on a single consumer GPU.
The pipeline combines the controllability of explicit 3D representations with the perceptual quality of generative video models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of scaffold construction from rendering could be applied to short video inputs to initialize richer initial geometry.
Adding simple dynamics on the Gaussian scaffold might allow basic object interactions without retraining the renderer.
Further compression of the student model could support deployment on lower-power devices for mobile scene exploration.

Load-bearing premise

The topology-aware diffusion reliably produces a geometrically consistent 360 panorama without errors that propagate into the later panoramic geometry-aware residual prediction.

What would settle it

Visible geometric drift, seams, or view-inconsistent artifacts appearing in the output video when the camera trajectory enters regions far outside the original narrow field of view.

read the original abstract

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoVerse chains panoramic diffusion, residual Gaussian lifting, and distillation into a concrete single-image to real-time video pipeline, but the geometric consistency of the first step remains unverified.

read the letter

The paper's core contribution is a three-stage pipeline that starts with a narrow-FOV image, expands it to a gravity-aligned 360 panorama via topology-aware diffusion, lifts that into a persistent 3D Gaussian scaffold through panoramic geometry-aware residual prediction, and then renders user-driven camera paths as video using a distilled causal student model.

What is actually new is the specific ordering and integration: using diffusion only for the panorama completion step, then explicit residuals for the scaffold, followed by bidirectional-to-causal distillation to hit bounded latency. Prior Gaussian work and video diffusion papers do not describe this exact sequence.

The design choice to separate world construction from rendering is sensible. It gives the system an explicit spatial memory that should support longer trajectories without the drift common in pure generative video models.

The main soft spot is the one flagged in the stress test. Everything downstream depends on the diffusion step producing a panorama whose geometry is accurate enough for the residual prediction to build a usable scaffold. Local depth errors or misalignments there will propagate directly. The abstract states the 8 FPS claim on an RTX 4090 but supplies no supporting numbers on panorama consistency, depth accuracy, or trajectory drift, so it is impossible to judge whether the central assumption holds.

No circular definitions or invented entities appear in the claims. The method builds on existing techniques without reducing the result to a fitted parameter.

This is for computer vision researchers working on single-image scene reconstruction and interactive world models. A reader who needs a practical system that mixes explicit 3D with generative rendering would find the pipeline worth examining.

It deserves peer review because the architecture is clearly described and the performance target is concrete enough to test, even though the geometric fidelity question will require detailed results and ablations.

Referee Report

2 major / 2 minor

Summary. The paper presents MoVerse, a real-time video world model that constructs an interactively navigable 3D scene from a single narrow-FOV image. It first applies topology-aware diffusion to expand the input into a gravity-aligned 360° panorama, then lifts the panorama into a persistent 3D Gaussian scaffold via panoramic geometry-aware residual prediction. A Gaussian-conditioned video renderer, trained via bidirectional diffusion teacher and distilled to a causal autoregressive student, translates scaffold renderings along user-specified trajectories into photorealistic video. The system claims 8 FPS real-time roaming on a single RTX 4090 GPU.

Significance. If the geometric consistency and real-time claims hold, the work offers a practical path to single-image world modeling that combines explicit 3D representations for controllability and long-range consistency with generative video models for perceptual quality. The separation of world construction from rendering and the distillation for bounded-latency streaming are notable design choices that could influence future interactive 3D generation systems.

major comments (2)

[Abstract / pipeline description] The central claim of persistent, controllable 3D geometry from a single image depends on the topology-aware diffusion step producing outputs that are sufficiently geometrically consistent for the subsequent panoramic geometry-aware residual prediction. No quantitative metrics (e.g., depth seam error, gravity alignment error, or propagation to scaffold drift over long trajectories) are reported to validate this assumption, which is load-bearing for the scaffold's ability to support artifact-free roaming.
[Abstract / results claim] The reported 8 FPS real-time performance on RTX 4090 is a key practical result, but the manuscript provides no breakdown of per-component timings (diffusion panorama completion, scaffold construction, renderer inference) or ablation on how distillation affects latency versus quality, making it impossible to assess whether the claimed speed is robust or tied to specific unstated implementation choices.

minor comments (2)

Notation for 'Panoramic Gaussian Scaffold' and 'panoramic geometry-aware residual prediction' should be defined with explicit equations or pseudocode in the methods section to clarify how residuals are computed and applied.
The abstract mentions 'gravity-aligned' panorama but does not specify the alignment mechanism or any failure cases when the input image lacks clear gravity cues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative validation of geometric consistency and detailed performance analysis. We address both major comments below and will incorporate additional metrics and breakdowns in the revised manuscript to better support the claims.

read point-by-point responses

Referee: [Abstract / pipeline description] The central claim of persistent, controllable 3D geometry from a single image depends on the topology-aware diffusion step producing outputs that are sufficiently geometrically consistent for the subsequent panoramic geometry-aware residual prediction. No quantitative metrics (e.g., depth seam error, gravity alignment error, or propagation to scaffold drift over long trajectories) are reported to validate this assumption, which is load-bearing for the scaffold's ability to support artifact-free roaming.

Authors: We recognize that explicit quantitative metrics for the geometric consistency of the topology-aware diffusion outputs would provide stronger support for the pipeline's assumptions. While the manuscript validates consistency through downstream visual quality, user studies on roaming, and qualitative panorama/scaffold results, we agree these specific metrics would directly address the concern. In the revised version, we will add depth seam error, gravity alignment error, and scaffold drift measurements over long trajectories, evaluated on a held-out validation set of scenes. revision: yes
Referee: [Abstract / results claim] The reported 8 FPS real-time performance on RTX 4090 is a key practical result, but the manuscript provides no breakdown of per-component timings (diffusion panorama completion, scaffold construction, renderer inference) or ablation on how distillation affects latency versus quality, making it impossible to assess whether the claimed speed is robust or tied to specific unstated implementation choices.

Authors: We agree that a per-component timing breakdown and distillation ablation are necessary to substantiate the real-time claim and allow assessment of robustness. In the revised manuscript, we will include a table with averaged inference times for panorama diffusion, scaffold construction, and renderer stages on the RTX 4090, along with an ablation comparing the bidirectional teacher and causal student models on both latency and quality metrics. This will clarify the contribution of distillation to achieving bounded-latency performance. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline uses external standard components

full rationale

The abstract and described pipeline separate world construction (topology-aware diffusion for 360° panorama, then panoramic geometry-aware residual prediction for Gaussian scaffold) from rendering (Gaussian-conditioned video model with teacher-student distillation). No equations, fitted parameters renamed as predictions, or self-citations are shown that reduce any load-bearing claim to its own inputs by construction. All steps invoke standard external techniques (diffusion models, 3D Gaussian representations) without self-referential definitions or uniqueness theorems imported from the authors' prior work. The real-time performance claim is presented as an empirical outcome on RTX 4090 hardware rather than a tautological derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on domain assumptions about diffusion consistency and Gaussian representability plus one invented entity; no explicit free parameters are named in the abstract.

axioms (1)

domain assumption Topology-aware diffusion can produce geometrically consistent 360 panoramas from narrow-FOV inputs without breaking downstream 3D lifting.
Invoked in the first stage of world construction.

invented entities (1)

Panoramic Gaussian Scaffold no independent evidence
purpose: Persistent dense 3D spatial memory that is directly renderable and supports controllable camera motion.
New representation introduced to separate world construction from observation rendering.

pith-pipeline@v0.9.1-grok · 5780 in / 1222 out tokens · 24309 ms · 2026-06-27T07:19:43.472111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 13 linked inside Pith

[1]

Wonderjourney: Going from anywhere to everywhere,

H.-X. Yu, H. Duan, J. Hur, K. Sargent, M. Rubinstein, W. T. Freeman, F. Cole, D. Sun, N. Snavely, J. Wuet al., “Wonderjourney: Going from anywhere to everywhere,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6658–6667

2024
[2]

Wonderworld: Interactive 3d scene generation from a single image,

H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu, “Wonderworld: Interactive 3d scene generation from a single image,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5916–5926. 17

2025
[3]

Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,

S. Yang, J. Tan, M. Zhang, T. Wu, G. Wetzstein, Z. Liu, and D. Lin, “Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,” inProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, 2025, pp. 1–10

2025
[4]

Self-evolving 3d scene generation from a single image,

K. Zheng, Y. Fan, J. Gu, Z. Xu, X. He, and X. E. Wang, “Self-evolving 3d scene generation from a single image,” arXiv preprint arXiv:2512.08905, 2025

arXiv 2025
[5]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,

H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhanget al., “Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,”arXiv preprint arXiv:2507.21809, 2025

arXiv 2025
[6]

Worldexplorer: Towards generating fully navigable 3d scenes,

M.-A. Schneider, L. Höllein, and M. Nießner, “Worldexplorer: Towards generating fully navigable 3d scenes,” in Proceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11

2025
[7]

Matrix-3d: Omnidirectional explorable 3d world generation,

Z. Yang, W. Ge, Y. Li, J. Chen, H. Li, M. An, F. Kang, H. Xue, B. Xu, Y. Yinet al., “Matrix-3d: Omnidirectional explorable 3d world generation,”arXiv preprint arXiv:2508.08086, 2025

arXiv 2025
[8]

Lyra 2.0: Explorable generative 3d worlds,

T. Shen, S. Bahmani, K. He, S. G. Srinivasan, T. Cao, J. Ren, R. Li, Z. Wang, N. Sharp, Z. Gojcicet al., “Lyra 2.0: Explorable generative 3d worlds,”arXiv preprint arXiv:2604.13036, 2026

Pith/arXiv arXiv 2026
[9]

Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,

T. HY-World, C. Cao, X. Zuo, Z. Wang, Y. Zhang, J. Wu, Z. Liu, Y. Gong, Y. Liu, B. Yuanet al., “Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,”arXiv preprint arXiv:2604.14268, 2026

Pith/arXiv arXiv 2026
[10]

Genie 3: A new frontier for world models,

G. DeepMind, “Genie 3: A new frontier for world models,” https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

2025
[11]

RTFM: A real-time frame model,

WorldLabs, “RTFM: A real-time frame model,” https://www.worldlabs.ai/blog/rtfm, 2025

2025
[12]

Context as memory: Scene-consistent interactive long video generation with memory retrieval,

J. Yu, J. Bai, Y. Qin, Q. Liu, X. Wang, P. Wan, D. Zhang, and X. Liu, “Context as memory: Scene-consistent interactive long video generation with memory retrieval,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11

2025
[13]

Matrix-game 2.0: An open-source real-time and streaming interactive world model,

X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Renet al., “Matrix-game 2.0: An open-source real-time and streaming interactive world model,”arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025
[14]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,

T. HunyuanWorld, “Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,”arXiv preprint, 2025

2025
[15]

Yume: An interactive world generation model,

X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang, “Yume: An interactive world generation model,”arXiv preprint arXiv:2507.17744, 2025

arXiv 2025
[16]

Relic: Interactive video world model with long-horizon memory,

Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoffroy, M. Roberts, M. Fisher, E. Shechtmanet al., “Relic: Interactive video world model with long-horizon memory,”arXiv preprint arXiv:2512.04040, 2025

arXiv 2025
[17]

Advancing open-source world models,

R. Team, Z. Gao, Q. Wang, Y. Zeng, J. Zhu, K. L. Cheng, Y. Li, H. Wang, Y. Xu, S. Maet al., “Advancing open-source world models,”arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026
[18]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,

Z. Wang, Z. Liu, J. Li, K. Huang, B. Xu, F. Kang, M. An, P. Wang, B. Jiang, Y. Weiet al., “Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,”arXiv preprint arXiv:2604.08995, 2026

Pith/arXiv arXiv 2026
[19]

Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,

H. Zhu, H. Liu, Y. Zhao, T. Ye, J. Chen, J. Yu, T. He, S. Han, and E. Xie, “Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,”arXiv preprint arXiv:2605.15178, 2026

Pith/arXiv arXiv 2026
[20]

Evoworld: Evolving panoramic world generation with explicit 3d memory,

J. Wang, L. Ye, T. Lu, J. Xiao, J. Zhang, Y. Guo, X. Liu, R. Chellappa, C. Peng, A. Yuilleet al., “Evoworld: Evolving panoramic world generation with explicit 3d memory,”arXiv preprint arXiv:2510.01183, 2025

arXiv 2025
[21]

Gen3c: 3d-informed world-consistent video generation with precise camera control,

X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao, “Gen3c: 3d-informed world-consistent video generation with precise camera control,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 6121–6132

2025
[22]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,

M. Yu, W. Hu, J. Xing, and Y. Shan, “Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2025, pp. 100–111

2025
[23]

Mocam: Unified novel view synthesis via structured denoising dynamics,

H. Liu, Y. Zhou, Z. Wang, Z. Xu, Z. Peng, J. Ma, J. Liang, S. He, and J. Li, “Mocam: Unified novel view synthesis via structured denoising dynamics,”arXiv preprint arXiv:2605.12119, 2026

Pith/arXiv arXiv 2026
[24]

One2scene: Geometric consistent explorable 3d scene generation from a single image,

P. Wang, L. Chen, Z. Ma, Y. Guo, G. Zhang, and L. Zhang, “One2scene: Geometric consistent explorable 3d scene generation from a single image,”arXiv preprint arXiv:2602.19766, 2026

arXiv 2026
[25]

Inspatio-worldfm: An open-source real-time generative frame model,

I. Team, D. Shen, G. Zhang, H. Liu, H. Ji, J. Liu, J. Guo, N. Wang, S. Pan, W. Panet al., “Inspatio-worldfm: An open-source real-time generative frame model,”arXiv preprint arXiv:2603.11911, 2026. 18

Pith/arXiv arXiv 2026
[26]

Panodiffusion: 360-degree panorama outpainting via diffusion,

T. Wu, C. Zheng, and T.-J. Cham, “Panodiffusion: 360-degree panorama outpainting via diffusion,” inICLR, 2024

2024
[27]

Dit360: High-fidelity panoramic image generation via hybrid training,

H. Feng, D. Zhang, X. Li, B. Du, and L. Qi, “Dit360: High-fidelity panoramic image generation via hybrid training,”arXiv preprint arXiv:2510.11712, 2025

arXiv 2025
[28]

Panorama generation from nfov image done right,

D. Zheng, C. Zhang, X.-M. Wu, C. Li, C. Lv, J.-F. Hu, and W.-S. Zheng, “Panorama generation from nfov image done right,” inCVPR, 2025, pp. 21610–21619

2025
[29]

Camfreediff: camera-free image to panorama generation with diffusion model,

X. Yuan, S. Tang, K. Li, and P. Wang, “Camfreediff: camera-free image to panorama generation with diffusion model,” inCVPR, 2025, pp. 16408–16417

2025
[30]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,

L. Jiang, Y. Mao, L. Xu, T. Lu, K. Ren, Y. Jin, X. Xu, M. Yu, J. Pang, F. Zhaoet al., “Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,”ACM Transactions on Graphics (TOG), vol. 44, no. 6, pp. 1–16, 2025

2025
[31]

Vg3t: Visual geometry grounded gaussian transformer,

J. Kim and S. Lee, “Vg3t: Visual geometry grounded gaussian transformer,”arXiv preprint arXiv:2512.05988, 2025

arXiv 2025
[32]

Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,

Z. Chen, C. Wu, Z. Shen, C. Zhao, W. Ye, H. Feng, E. Ding, and S.-H. Zhang, “Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21590–21599

2025
[33]

Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,

C. Zhang, H. Xu, Q. Wu, C. C. Gambardella, D. Phung, and J. Cai, “Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11437–11447

2025
[34]

Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,

J. Ren, M. Xiang, J. Zhu, and Y. Dai, “Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28959–28969

2025
[35]

Sharp monocular view synthesis in less than a second,

L. Mescheder, W. Dong, S. Li, X. Bai, M. Santos, P. Hu, B. Lecouat, M. Zhen, A. Delaunoy, T. Fanget al., “Sharp monocular view synthesis in less than a second,”arXiv preprint arXiv:2512.10685, 2025

arXiv 2025
[36]

Wan: Open and advanced large-scale video generative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[37]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[38]

Hunyuanvideo: A systematic framework for large video generative models,

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[39]

Pluralistic image completion,

C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic image completion,” inCVPR, 2019, pp. 1438–1447

2019
[40]

Wide-context semantic image extrapolation,

Y. Wang, X. Tao, X. Shen, and J. Jia, “Wide-context semantic image extrapolation,” inCVPR, 2019, pp. 1399–1408

2019
[41]

Large scale image completion via co-modulated generative adversarial networks,

S. Zhao, J. Cui, Y. Sheng, Y. Dong, X. Liang, E. I. Chang, and Y. Xu, “Large scale image completion via co-modulated generative adversarial networks,”arXiv preprint arXiv:2103.10428, 2021

arXiv 2021
[42]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10684–10695

2022
[43]

Repaint: Inpainting using denoising diffusion probabilistic models,

A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inCVPR, 2022, pp. 11461–11471

2022
[44]

Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,

J. Li and M. Bansal, “Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,”NeurIPS, vol. 36, pp. 21878–21894, 2023

2023
[45]

Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,

M. Feng, J. Liu, M. Cui, and X. Xie, “Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,”arXiv preprint arXiv:2311.13141, 2023

arXiv 2023
[46]

Matrix3d: Large photogrammetry model all-in-one,

Y. Lu, J. Zhang, T. Fang, J.-D. Nahmias, Y. Tsin, L. Quan, X. Cao, Y. Yao, and S. Li, “Matrix3d: Large photogrammetry model all-in-one,” inCVPR, 2025, pp. 11250–11263

2025
[47]

Syncdiffusion: Coherent montage via synchronized joint diffusions,

Y. Lee, K. Kim, H. Kim, and M. Sung, “Syncdiffusion: Coherent montage via synchronized joint diffusions,” NeurIPS, vol. 36, pp. 50648–50660, 2023

2023
[48]

360dvd: Controllable panorama video generation with 360-degree video diffusion model,

Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang, “360dvd: Controllable panorama video generation with 360-degree video diffusion model,” inCVPR, 2024, pp. 6913–6923

2024
[49]

Cylin-painting: Seamless 360 panoramic image outpainting and beyond,

K. Liao, X. Xu, C. Lin, W. Ren, Y. Wei, and Y. Zhao, “Cylin-painting: Seamless 360 panoramic image outpainting and beyond,”IEEE TIP, vol. 33, pp. 382–394, 2023

2023
[50]

Spatial transformer networks,

M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,”NeurIPS, vol. 28, 2015. 19

2015
[51]

Recognizing scene viewpoint using panoramic place represen- tation,

J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place represen- tation,” inCVPR. IEEE, 2012, pp. 2695–2702

2012
[52]

Matterport3d: Learning from rgb-d data in indoor environments,

A.Chang, A.Dai, T.Funkhouser, M.Halber, M.Niebner, M.Savva, S.Song, A.Zeng, andY.Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” in2017 International Conference on 3D Vision (3DV). IEEE Computer Society, 2017, pp. 667–676

2017
[53]

Poly haven hdris,

Poly Haven, “Poly haven hdris,” https://polyhaven.com/hdris, accessed: December 2025

2025
[54]

Panocontext: A whole-room 3d context model for panoramic scene understanding,

Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” inECCV. Springer, 2014, pp. 668–686

2014
[55]

Layoutnet: Reconstructing the 3d room layout from a single rgb image,

C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” inCVPR, 2018, pp. 2051–2059

2018
[56]

Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,

C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,” inCVPR, 2019, pp. 1047–1056

2019
[57]

Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,

Z. Jiang, Z. Xiang, J. Xu, and M. Zhao, “Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,” inCVPR, 2022, pp. 1654–1663

2022
[58]

Hohonet: 360 indoor holistic understanding with latent horizontal features,

C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” in CVPR, 2021, pp. 2573–2582

2021
[59]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, ...

Pith/arXiv arXiv 2021
[60]

Self forcing: Bridging the train-test gap in autoregressive video diffusion,

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”Advances in Neural Information Processing Systems, vol. 38, pp. 167283–167308, 2026

2026
[61]

One-step diffusion with distribution matching distillation

T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park, “One-step diffusion with distribution matching distillation.” inCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6613–6623

2023
[62]

Raven: Real-time autoregressive video extrapolation with consistency-model grpo,

Y. Lu, R. Zuo, and J. Deng, “Raven: Real-time autoregressive video extrapolation with consistency-model grpo,” arXiv preprint arXiv:2605.15190, 2026

Pith/arXiv arXiv 2026
[63]

Memrope: Training-free infinite video generation via evolving memory tokens,

Y. Kim, Q. Hu, C.-C. J. Kuo, and P. A. Beerel, “Memrope: Training-free infinite video generation via evolving memory tokens,”arXiv preprint arXiv:2603.12513, 2026

arXiv 2026
[64]

Taehv: Tiny autoencoder for hunyuan video,

O. Boer Bohan, “Taehv: Tiny autoencoder for hunyuan video,” https://github.com/madebyollin/taehv, 2025. 20

2025

[1] [1]

Wonderjourney: Going from anywhere to everywhere,

H.-X. Yu, H. Duan, J. Hur, K. Sargent, M. Rubinstein, W. T. Freeman, F. Cole, D. Sun, N. Snavely, J. Wuet al., “Wonderjourney: Going from anywhere to everywhere,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6658–6667

2024

[2] [2]

Wonderworld: Interactive 3d scene generation from a single image,

H.-X. Yu, H. Duan, C. Herrmann, W. T. Freeman, and J. Wu, “Wonderworld: Interactive 3d scene generation from a single image,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5916–5926. 17

2025

[3] [3]

Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,

S. Yang, J. Tan, M. Zhang, T. Wu, G. Wetzstein, Z. Liu, and D. Lin, “Layerpano3d: Layered 3d panorama for hyper-immersive scene generation,” inProceedings of the special interest group on computer graphics and interactive techniques conference conference papers, 2025, pp. 1–10

2025

[4] [4]

Self-evolving 3d scene generation from a single image,

K. Zheng, Y. Fan, J. Gu, Z. Xu, X. He, and X. E. Wang, “Self-evolving 3d scene generation from a single image,” arXiv preprint arXiv:2512.08905, 2025

arXiv 2025

[5] [5]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,

H. Team, Z. Wang, Y. Liu, J. Wu, Z. Gu, H. Wang, X. Zuo, T. Huang, W. Li, S. Zhanget al., “Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels,”arXiv preprint arXiv:2507.21809, 2025

arXiv 2025

[6] [6]

Worldexplorer: Towards generating fully navigable 3d scenes,

M.-A. Schneider, L. Höllein, and M. Nießner, “Worldexplorer: Towards generating fully navigable 3d scenes,” in Proceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11

2025

[7] [7]

Matrix-3d: Omnidirectional explorable 3d world generation,

Z. Yang, W. Ge, Y. Li, J. Chen, H. Li, M. An, F. Kang, H. Xue, B. Xu, Y. Yinet al., “Matrix-3d: Omnidirectional explorable 3d world generation,”arXiv preprint arXiv:2508.08086, 2025

arXiv 2025

[8] [8]

Lyra 2.0: Explorable generative 3d worlds,

T. Shen, S. Bahmani, K. He, S. G. Srinivasan, T. Cao, J. Ren, R. Li, Z. Wang, N. Sharp, Z. Gojcicet al., “Lyra 2.0: Explorable generative 3d worlds,”arXiv preprint arXiv:2604.13036, 2026

Pith/arXiv arXiv 2026

[9] [9]

Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,

T. HY-World, C. Cao, X. Zuo, Z. Wang, Y. Zhang, J. Wu, Z. Liu, Y. Gong, Y. Liu, B. Yuanet al., “Hy-world 2.0: A multi-modal world model for reconstructing, generating, and simulating 3d worlds,”arXiv preprint arXiv:2604.14268, 2026

Pith/arXiv arXiv 2026

[10] [10]

Genie 3: A new frontier for world models,

G. DeepMind, “Genie 3: A new frontier for world models,” https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/, 2025

2025

[11] [11]

RTFM: A real-time frame model,

WorldLabs, “RTFM: A real-time frame model,” https://www.worldlabs.ai/blog/rtfm, 2025

2025

[12] [12]

Context as memory: Scene-consistent interactive long video generation with memory retrieval,

J. Yu, J. Bai, Y. Qin, Q. Liu, X. Wang, P. Wan, D. Zhang, and X. Liu, “Context as memory: Scene-consistent interactive long video generation with memory retrieval,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11

2025

[13] [13]

Matrix-game 2.0: An open-source real-time and streaming interactive world model,

X. He, C. Peng, Z. Liu, B. Wang, Y. Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y. Renet al., “Matrix-game 2.0: An open-source real-time and streaming interactive world model,”arXiv preprint arXiv:2508.13009, 2025

Pith/arXiv arXiv 2025

[14] [14]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,

T. HunyuanWorld, “Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency,”arXiv preprint, 2025

2025

[15] [15]

Yume: An interactive world generation model,

X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y. Qiao, and K. Zhang, “Yume: An interactive world generation model,”arXiv preprint arXiv:2507.17744, 2025

arXiv 2025

[16] [16]

Relic: Interactive video world model with long-horizon memory,

Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoffroy, M. Roberts, M. Fisher, E. Shechtmanet al., “Relic: Interactive video world model with long-horizon memory,”arXiv preprint arXiv:2512.04040, 2025

arXiv 2025

[17] [17]

Advancing open-source world models,

R. Team, Z. Gao, Q. Wang, Y. Zeng, J. Zhu, K. L. Cheng, Y. Li, H. Wang, Y. Xu, S. Maet al., “Advancing open-source world models,”arXiv preprint arXiv:2601.20540, 2026

Pith/arXiv arXiv 2026

[18] [18]

Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,

Z. Wang, Z. Liu, J. Li, K. Huang, B. Xu, F. Kang, M. An, P. Wang, B. Jiang, Y. Weiet al., “Matrix-game 3.0: Real-time and streaming interactive world model with long-horizon memory,”arXiv preprint arXiv:2604.08995, 2026

Pith/arXiv arXiv 2026

[19] [19]

Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,

H. Zhu, H. Liu, Y. Zhao, T. Ye, J. Chen, J. Yu, T. He, S. Han, and E. Xie, “Sana-wm: Efficient minute-scale world modeling with hybrid linear diffusion transformer,”arXiv preprint arXiv:2605.15178, 2026

Pith/arXiv arXiv 2026

[20] [20]

Evoworld: Evolving panoramic world generation with explicit 3d memory,

J. Wang, L. Ye, T. Lu, J. Xiao, J. Zhang, Y. Guo, X. Liu, R. Chellappa, C. Peng, A. Yuilleet al., “Evoworld: Evolving panoramic world generation with explicit 3d memory,”arXiv preprint arXiv:2510.01183, 2025

arXiv 2025

[21] [21]

Gen3c: 3d-informed world-consistent video generation with precise camera control,

X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao, “Gen3c: 3d-informed world-consistent video generation with precise camera control,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 6121–6132

2025

[22] [22]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,

M. Yu, W. Hu, J. Xing, and Y. Shan, “Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2025, pp. 100–111

2025

[23] [23]

Mocam: Unified novel view synthesis via structured denoising dynamics,

H. Liu, Y. Zhou, Z. Wang, Z. Xu, Z. Peng, J. Ma, J. Liang, S. He, and J. Li, “Mocam: Unified novel view synthesis via structured denoising dynamics,”arXiv preprint arXiv:2605.12119, 2026

Pith/arXiv arXiv 2026

[24] [24]

One2scene: Geometric consistent explorable 3d scene generation from a single image,

P. Wang, L. Chen, Z. Ma, Y. Guo, G. Zhang, and L. Zhang, “One2scene: Geometric consistent explorable 3d scene generation from a single image,”arXiv preprint arXiv:2602.19766, 2026

arXiv 2026

[25] [25]

Inspatio-worldfm: An open-source real-time generative frame model,

I. Team, D. Shen, G. Zhang, H. Liu, H. Ji, J. Liu, J. Guo, N. Wang, S. Pan, W. Panet al., “Inspatio-worldfm: An open-source real-time generative frame model,”arXiv preprint arXiv:2603.11911, 2026. 18

Pith/arXiv arXiv 2026

[26] [26]

Panodiffusion: 360-degree panorama outpainting via diffusion,

T. Wu, C. Zheng, and T.-J. Cham, “Panodiffusion: 360-degree panorama outpainting via diffusion,” inICLR, 2024

2024

[27] [27]

Dit360: High-fidelity panoramic image generation via hybrid training,

H. Feng, D. Zhang, X. Li, B. Du, and L. Qi, “Dit360: High-fidelity panoramic image generation via hybrid training,”arXiv preprint arXiv:2510.11712, 2025

arXiv 2025

[28] [28]

Panorama generation from nfov image done right,

D. Zheng, C. Zhang, X.-M. Wu, C. Li, C. Lv, J.-F. Hu, and W.-S. Zheng, “Panorama generation from nfov image done right,” inCVPR, 2025, pp. 21610–21619

2025

[29] [29]

Camfreediff: camera-free image to panorama generation with diffusion model,

X. Yuan, S. Tang, K. Li, and P. Wang, “Camfreediff: camera-free image to panorama generation with diffusion model,” inCVPR, 2025, pp. 16408–16417

2025

[30] [30]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,

L. Jiang, Y. Mao, L. Xu, T. Lu, K. Ren, Y. Jin, X. Xu, M. Yu, J. Pang, F. Zhaoet al., “Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,”ACM Transactions on Graphics (TOG), vol. 44, no. 6, pp. 1–16, 2025

2025

[31] [31]

Vg3t: Visual geometry grounded gaussian transformer,

J. Kim and S. Lee, “Vg3t: Visual geometry grounded gaussian transformer,”arXiv preprint arXiv:2512.05988, 2025

arXiv 2025

[32] [32]

Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,

Z. Chen, C. Wu, Z. Shen, C. Zhao, W. Ye, H. Feng, E. Ding, and S.-H. Zhang, “Splatter-360: Generalizable 360 gaussian splatting for wide-baseline panoramic images,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21590–21599

2025

[33] [33]

Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,

C. Zhang, H. Xu, Q. Wu, C. C. Gambardella, D. Phung, and J. Cai, “Pansplat: 4k panorama synthesis with feed-forward gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 11437–11447

2025

[34] [34]

Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,

J. Ren, M. Xiang, J. Zhu, and Y. Dai, “Panosplatt3r: Leveraging perspective pretraining for generalized unposed wide-baseline panorama reconstruction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28959–28969

2025

[35] [35]

Sharp monocular view synthesis in less than a second,

L. Mescheder, W. Dong, S. Li, X. Bai, M. Santos, P. Hu, B. Lecouat, M. Zhen, A. Delaunoy, T. Fanget al., “Sharp monocular view synthesis in less than a second,”arXiv preprint arXiv:2512.10685, 2025

arXiv 2025

[36] [36]

Wan: Open and advanced large-scale video generative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yanget al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[37] [37]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Fenget al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[38] [38]

Hunyuanvideo: A systematic framework for large video generative models,

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[39] [39]

Pluralistic image completion,

C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic image completion,” inCVPR, 2019, pp. 1438–1447

2019

[40] [40]

Wide-context semantic image extrapolation,

Y. Wang, X. Tao, X. Shen, and J. Jia, “Wide-context semantic image extrapolation,” inCVPR, 2019, pp. 1399–1408

2019

[41] [41]

Large scale image completion via co-modulated generative adversarial networks,

S. Zhao, J. Cui, Y. Sheng, Y. Dong, X. Liang, E. I. Chang, and Y. Xu, “Large scale image completion via co-modulated generative adversarial networks,”arXiv preprint arXiv:2103.10428, 2021

arXiv 2021

[42] [42]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10684–10695

2022

[43] [43]

Repaint: Inpainting using denoising diffusion probabilistic models,

A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inCVPR, 2022, pp. 11461–11471

2022

[44] [44]

Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,

J. Li and M. Bansal, “Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation,”NeurIPS, vol. 36, pp. 21878–21894, 2023

2023

[45] [45]

Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,

M. Feng, J. Liu, M. Cui, and X. Xie, “Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models,”arXiv preprint arXiv:2311.13141, 2023

arXiv 2023

[46] [46]

Matrix3d: Large photogrammetry model all-in-one,

Y. Lu, J. Zhang, T. Fang, J.-D. Nahmias, Y. Tsin, L. Quan, X. Cao, Y. Yao, and S. Li, “Matrix3d: Large photogrammetry model all-in-one,” inCVPR, 2025, pp. 11250–11263

2025

[47] [47]

Syncdiffusion: Coherent montage via synchronized joint diffusions,

Y. Lee, K. Kim, H. Kim, and M. Sung, “Syncdiffusion: Coherent montage via synchronized joint diffusions,” NeurIPS, vol. 36, pp. 50648–50660, 2023

2023

[48] [48]

360dvd: Controllable panorama video generation with 360-degree video diffusion model,

Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang, “360dvd: Controllable panorama video generation with 360-degree video diffusion model,” inCVPR, 2024, pp. 6913–6923

2024

[49] [49]

Cylin-painting: Seamless 360 panoramic image outpainting and beyond,

K. Liao, X. Xu, C. Lin, W. Ren, Y. Wei, and Y. Zhao, “Cylin-painting: Seamless 360 panoramic image outpainting and beyond,”IEEE TIP, vol. 33, pp. 382–394, 2023

2023

[50] [50]

Spatial transformer networks,

M. Jaderberg, K. Simonyan, A. Zissermanet al., “Spatial transformer networks,”NeurIPS, vol. 28, 2015. 19

2015

[51] [51]

Recognizing scene viewpoint using panoramic place represen- tation,

J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place represen- tation,” inCVPR. IEEE, 2012, pp. 2695–2702

2012

[52] [52]

Matterport3d: Learning from rgb-d data in indoor environments,

A.Chang, A.Dai, T.Funkhouser, M.Halber, M.Niebner, M.Savva, S.Song, A.Zeng, andY.Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” in2017 International Conference on 3D Vision (3DV). IEEE Computer Society, 2017, pp. 667–676

2017

[53] [53]

Poly haven hdris,

Poly Haven, “Poly haven hdris,” https://polyhaven.com/hdris, accessed: December 2025

2025

[54] [54]

Panocontext: A whole-room 3d context model for panoramic scene understanding,

Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” inECCV. Springer, 2014, pp. 668–686

2014

[55] [55]

Layoutnet: Reconstructing the 3d room layout from a single rgb image,

C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” inCVPR, 2018, pp. 2051–2059

2018

[56] [56]

Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,

C. Sun, C.-W. Hsiao, M. Sun, and H.-T. Chen, “Horizonnet: Learning room layout with 1d representation and pano stretch data augmentation,” inCVPR, 2019, pp. 1047–1056

2019

[57] [57]

Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,

Z. Jiang, Z. Xiang, J. Xu, and M. Zhao, “Lgt-net: Indoor panoramic room layout estimation with geometry-aware transformer network,” inCVPR, 2022, pp. 1654–1663

2022

[58] [58]

Hohonet: 360 indoor holistic understanding with latent horizontal features,

C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” in CVPR, 2021, pp. 2573–2582

2021

[59] [59]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y. Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, ...

Pith/arXiv arXiv 2021

[60] [60]

Self forcing: Bridging the train-test gap in autoregressive video diffusion,

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”Advances in Neural Information Processing Systems, vol. 38, pp. 167283–167308, 2026

2026

[61] [61]

One-step diffusion with distribution matching distillation

T. Yin, M. Gharbi, R. Zhang, E. Shechtman, F. Durand, W. T. Freeman, and T. Park, “One-step diffusion with distribution matching distillation.” inCVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 6613–6623

2023

[62] [62]

Raven: Real-time autoregressive video extrapolation with consistency-model grpo,

Y. Lu, R. Zuo, and J. Deng, “Raven: Real-time autoregressive video extrapolation with consistency-model grpo,” arXiv preprint arXiv:2605.15190, 2026

Pith/arXiv arXiv 2026

[63] [63]

Memrope: Training-free infinite video generation via evolving memory tokens,

Y. Kim, Q. Hu, C.-C. J. Kuo, and P. A. Beerel, “Memrope: Training-free infinite video generation via evolving memory tokens,”arXiv preprint arXiv:2603.12513, 2026

arXiv 2026

[64] [64]

Taehv: Tiny autoencoder for hunyuan video,

O. Boer Bohan, “Taehv: Tiny autoencoder for hunyuan video,” https://github.com/madebyollin/taehv, 2025. 20

2025