pith. sign in

arxiv: 2606.24876 · v1 · pith:KVBSA65Unew · submitted 2026-06-23 · 💻 cs.CV

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Pith reviewed 2026-06-26 00:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D scene generationtriangle splattingvideo diffusion modelsfeedforward decodinggeometric accuracysurface primitivesdifferentiable rendering
0
0 comments X

The pith

Video diffusion latents can be decoded directly into oriented triangle splats for 3D scenes with improved geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion models implicitly capture multi-view geometry in compressed latents, yet most feedforward decoders still output volumetric Gaussians that lack defined surfaces. The paper demonstrates that these latents can instead be mapped in one pass to explicit triangle splats, which are surface-aligned and closer to usable 3D assets. Two components make the regression stable: a ray-centered rotation parameterization that reduces orientation sensitivity and a product window function that strengthens gradients through differentiable rendering. On benchmarks this produces markedly better geometric accuracy than Gaussian baselines while visual quality stays competitive. A short test-time step then converts the output into opaque, real-time renderable meshes for game engines.

Core claim

Triangle splats can be decoded directly from video diffusion latents for the first time. A ray-centered rotation parameterization handles the high sensitivity of flat primitive orientations, while a novel product window function improves gradient flow during differentiable triangle rendering. When trained under the same conditions as 3DGS and 2DGS variants, the resulting scenes show significantly better geometric accuracy with competitive visual quality, and a lightweight refinement converts the triangle soup into fully opaque assets.

What carries the argument

Ray-centered rotation parameterization for triangle orientation regression, combined with a product window function for stable differentiable rendering of flat primitives.

If this is right

  • Feedforward single-image scene generation produces outputs with measurably higher geometric accuracy than Gaussian splatting baselines.
  • Surface-aligned triangle primitives integrate more directly into standard graphics pipelines and simulation tools than volumetric representations.
  • A lightweight test-time refinement converts the predicted triangles into opaque meshes that support real-time rendering.
  • Systematic comparison under identical training reveals concrete tradeoffs among 3DGS, 2DGS, and triangle splatting representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other latent diffusion backbones if their latents prove similarly rich in geometric cues.
  • Surface primitives may simplify downstream tasks such as collision detection or texture baking that currently require extra post-processing of Gaussian outputs.
  • Testing whether the same parameterization works when the input is a short video clip rather than a single image would clarify how much multi-view information the latents must contain.

Load-bearing premise

The compressed latents of existing video diffusion models already encode sufficient explicit multi-view geometric structure to support direct regression of oriented triangle primitives without additional geometric supervision or iterative refinement during training.

What would settle it

An ablation that trains the identical decoder architecture on the same latents but outputs 3D Gaussians instead of triangles and measures no geometric accuracy improvement, or removal of the rotation parameterization that causes training to fail to converge on flat primitives.

Figures

Figures reproduced from arXiv: 2606.24876 by Christian Rupprecht, Fabian Manhardt, Federico Tombari, Goutam Bhat, Orest Kupyn, Philipp Henzler.

Figure 1
Figure 1. Figure 1: FLAT regress soft triangles directly from video diffusion latent, enabling geometrically [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline: Starting from a single image, we construct a point-cloud-based control video by rendering along the target camera trajectory. The control video and camera embeddings condition a frozen video diffusion model [7]. The scene decoder then fuses denoised video latent with the camera latent and decodes triangle-splat scene representation for novel-view synthesis. optimization. However, extending feedfo… view at source ↗
Figure 3
Figure 3. Figure 3: Window Function: Comparison of sigmoid-based window function [26, 14], max edge distance is used in [25] and ours. FLAT function extends the influence outside the triangle boundary and improves gradient flow by routing to all three vertices. require additional constraints. Thus, feedforward training is particularly sensitive to the choice of parameterization. We predict each triangle relative to a ray-cent… view at source ↗
Figure 4
Figure 4. Figure 4: Geometric Quality: The latent triangle model generates finer, more accurate geometry compared to Gaussian representations that are optimized for visual quality, while still maintaining high rendering fidelity [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline Flexibility: FLAT replaces the standard RGB decoder with a latent scene decoder. Because multiple Wan variants share the same latent space, our scene decoder can be attached to any of these, including image-to-video, text-to-video, video-to-video, interactive, and world-consistent pipelines [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Text-to-3D Scene: Examples obtained by attaching FLAT to a Wan-2.1 text-to-video pipeline. For each scene, we show rendered views together with the corresponding predicted normal map. The examples demonstrate that the same latent scene decoder can convert text-to-video model latents into explicit geometry. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: , where optimization improves both rendered appearance and predicted normals [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Metric Limitations: Gaussians are optimized for PSNR directly due to their inherent smoothness. The triangle model often generates sharper details while achieving lower PSNR [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Results: More qualitative results covering indoor, outdoor, and object-centric scenes, focusing on surface and visual quality. Each sample consists of input image, novel view and novel view normal map. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Failure Cases: Thin, elongated surfaces, tiny details and reflections remain challenging to model with triangles [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Converted Mesh: Top Row: semi opaque triangles predicted by the model. Bottow Row: opaque game engine compatible mesh generated by lightweight conversion step. The scene render remains high as strong semi-opaque geometrically accurate initial predictions simplify conversion process. E Scene Decoder Architecture Scene decoder matches Wan-2.1 VAE architecture. It utilizes 3D causal convolution (CausalConv3D… view at source ↗
Figure 12
Figure 12. Figure 12: Cross-Platform Rendering: Rendering raw output without any postprocessing or mesh cleanup. The converted solid triangles can be rasterized by any rendering engine across various platforms, supporting high-resolution and high-fps efficient rendering across devices. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: The resulting mesh can be effectively rendered on any platform with high efficiency. We [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces FLAT, a feedforward decoder that maps compressed latents from video diffusion models directly to oriented triangle splats for single-image 3D scene generation. It proposes a ray-centered rotation parameterization and a product window function to mitigate poor gradient flow when regressing flat primitives, claims significantly improved geometric accuracy over 3D/2D Gaussian baselines under identical training, and adds a lightweight test-time refinement step to produce opaque, real-time renderable assets. A systematic comparison of representation tradeoffs (3DGS, 2DGS, triangle splatting) is also presented.

Significance. If the geometric accuracy gains are substantiated by the experiments, the work would advance feedforward scene generation by shifting from volumetric to surface-aligned primitives that are closer to explicit assets usable in simulation and graphics pipelines. The identical-training-setup comparison of representations is a clear strength that enables fair tradeoff analysis.

major comments (2)
  1. [Abstract] Abstract and § on method: The claim that existing video diffusion latents already embed sufficient explicit multi-view geometric structure for single-pass oriented-triangle regression is load-bearing for the 'direct decoding' premise, yet the provided text gives no direct measurement (e.g., latent-to-geometry probe accuracy or ablation that removes geometric cues from the diffusion prior). Decoder-side fixes alone cannot supply absent information.
  2. [Abstract] Abstract: The statement that triangle splats achieve 'significantly better geometric accuracy' while remaining competitive in visual quality requires the quantitative tables and error analysis that are referenced but not visible in the supplied text; without those numbers the central empirical claim cannot be evaluated.
minor comments (1)
  1. The project page URL is given but the abstract contains no numerical results, benchmark names, or ablation tables, making it difficult to assess the scale of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, with planned revisions where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract] Abstract and § on method: The claim that existing video diffusion latents already embed sufficient explicit multi-view geometric structure for single-pass oriented-triangle regression is load-bearing for the 'direct decoding' premise, yet the provided text gives no direct measurement (e.g., latent-to-geometry probe accuracy or ablation that removes geometric cues from the diffusion prior). Decoder-side fixes alone cannot supply absent information.

    Authors: We agree that a direct probe or ablation isolating the geometric content of the latents would provide stronger support for the premise. The current evidence is indirect via the decoder's superior geometric metrics relative to Gaussian baselines under identical training. We will add a linear probe experiment on the latents for depth/normal prediction and a control ablation using a non-geometric encoder in the revised Section 4 to quantify the prior's contribution. revision: yes

  2. Referee: [Abstract] Abstract: The statement that triangle splats achieve 'significantly better geometric accuracy' while remaining competitive in visual quality requires the quantitative tables and error analysis that are referenced but not visible in the supplied text; without those numbers the central empirical claim cannot be evaluated.

    Authors: The supporting numbers appear in the full manuscript's Tables 1–3 and Figures 4–6 (Section 4), which report depth error, normal consistency, PSNR/SSIM/LPIPS, and error distributions for all representations under matched training. We will revise the abstract to include one or two key quantitative deltas and ensure explicit table references appear in the method and results sections. revision: partial

Circularity Check

0 steps flagged

No circularity: novel decoder components and empirical comparisons are independent of inputs

full rationale

The paper's core contribution consists of two explicitly introduced decoder innovations (ray-centered rotation parameterization and product window function) that address gradient issues in triangle regression from external video diffusion latents. These are not derived from or equivalent to the target outputs by construction, nor do any equations reduce fitted parameters to predictions. The evaluation compares representations under identical training setups using external priors, with no load-bearing self-citations or ansatzes imported from prior author work. The assumption that latents contain usable geometry is tested empirically rather than presupposed definitionally, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that diffusion latents contain usable geometric structure.

pith-pipeline@v0.9.1-grok · 5841 in / 1089 out tokens · 27029 ms · 2026-06-26T00:09:23.132397+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 19 linked inside Pith

  1. [1]

    Onestory: Coherent multi-shot video generation with adaptive memory

    Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16173–16184, 2026

  2. [2]

    Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

    Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

  3. [3]

    Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xu- anchi Ren

    Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xu- anchi Ren. Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation. InInternational Conference on Learning Representations (ICLR), 2026

  4. [4]

    Normalcrafter: Learning temporally consistent normals from video diffusion priors

    Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, and Bing Wang. Normalcrafter: Learning temporally consistent normals from video diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339, 2025

  5. [5]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

  6. [6]

    Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image-to-3d generation and reconstruction

    Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, et al. Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image-to-3d generation and reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25062–25072, 2025

  7. [7]

    Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation

    Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025

  8. [8]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024

  9. [9]

    Beyond gaussians: Fast and high-fidelity 3d splatting with linear kernels

    Haodong Chen, Runnan Chen, Qiang Qu, Zhaoqing Wang, Tongliang Liu, Xiaoming Chen, and Yuk Ying Chung. Beyond gaussians: Fast and high-fidelity 3d splatting with linear kernels. arXiv preprint arXiv:2411.12440, 2024. 10

  10. [10]

    Learning to predict 3d objects with an interpolation-based differentiable renderer

    Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019

  11. [11]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean conference on computer vision, pages 370–386. Springer, 2024

  12. [12]

    Wan-move: Motion-controllable video generation via latent trajectory guidance.Advances in Neural Information Processing Systems, 38:404–432, 2026

    Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Dingdong W ANG, Hong- wei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.Advances in Neural Information Processing Systems, 38:404–432, 2026

  13. [13]

    Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

  14. [14]

    Cvxnet: Learnable convex decomposition

    Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. Cvxnet: Learnable convex decomposition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 31–44, 2020

  15. [15]

    An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  16. [16]

    Videogpa: Distilling geometry priors for 3d-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

    Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. Videogpa: Distilling geometry priors for 3d-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

  17. [17]

    Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

  18. [18]

    Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  19. [19]

    Radiant foam: Real-time differentiable ray tracing

    Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi, and Andrea Tagliasacchi. Radiant foam: Real-time differentiable ray tracing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4135–4145, 2025

  20. [20]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  21. [21]

    Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724, 2026

    Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, and Mike Zheng Shou. Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724, 2026

  22. [22]

    Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

    Antoine Guédon, Diego Gomez, Nissim Maruani, Bingchen Gong, George Drettakis, and Maks Ovsjanikov. Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

  23. [23]

    Ges: Generalized exponential splatting for efficient radiance field rendering

    Abdullah Hamdi, Luke Melas-Kyriazi, Jinjie Mai, Guocheng Qian, Ruoshi Liu, Carl V ondrick, Bernard Ghanem, and Andrea Vedaldi. Ges: Generalized exponential splatting for efficient radiance field rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19812–19822, 2024

  24. [24]

    Meshsplatting: Differentiable rendering with opaque meshes.arXiv preprint arXiv:2512.06818, 2025

    Jan Held, Sanghyun Son, Renaud Vandeghen, Daniel Rebain, Matheus Gadelha, Yi Zhou, An- thony Cioppa, Ming C Lin, Marc Van Droogenbroeck, and Andrea Tagliasacchi. Meshsplatting: Differentiable rendering with opaque meshes.arXiv preprint arXiv:2512.06818, 2025. 11

  25. [25]

    Triangle splatting for real-time radiance field rendering

    Jan Held, Renaud Vandeghen, Adrien Deliege, Abdullah Hamdi, Daniel Rebain, Silvio Giancola, Anthony Cioppa, Andrea Vedaldi, Bernard Ghanem, Andrea Tagliasacchi, et al. Triangle splatting for real-time radiance field rendering. InThirteenth International Conference on 3D Vision, 2025

  26. [26]

    3d convex splatting: Radiance field rendering with 3d smooth convexes

    Jan Held, Renaud Vandeghen, Abdullah Hamdi, Adrien Deliege, Anthony Cioppa, Silvio Giancola, Andrea Vedaldi, Bernard Ghanem, and Marc Van Droogenbroeck. 3d convex splatting: Radiance field rendering with 3d smooth convexes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21360–21369, 2025

  27. [27]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  28. [28]

    2d gaussian splatting for geometrically accurate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

  29. [29]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

  30. [30]

    Deformable radial kernel splatting

    Yi-Hua Huang, Ming-Xian Lin, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Deformable radial kernel splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21513–21523, 2025

  31. [31]

    Lvsm: A large view synthesis model with minimal 3d inductive bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. arXiv preprint arXiv:2410.17242, 2024

  32. [32]

    Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

  33. [33]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  34. [34]

    Neuralfield-ldm: Scene generation with hierarchical latent diffusion models

    Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, and Sanja Fidler. Neuralfield-ldm: Scene generation with hierarchical latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8496–8506, 2023

  35. [35]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  36. [36]

    Epipolar geometry improves video generation models.arXiv preprint arXiv:2510.21615, 2025

    Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry improves video generation models.arXiv preprint arXiv:2510.21615, 2025

  37. [37]

    S3od: Towards generalizable salient object detection with synthetic data

    Orest Kupyn, Hirokatsu Kataoka, and Christian Rupprecht. S3od: Towards generalizable salient object detection with synthetic data. InInternational Conference on Learning Representations (ICLR), 2026

  38. [38]

    Wonderland: Navigating 3d scenes from a single image

    Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 798–810, 2025

  39. [39]

    Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 12

  40. [40]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  41. [41]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

  42. [42]

    Soft rasterizer: A differentiable renderer for image-based 3d reasoning

    Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 7708–7717, 2019

  43. [43]

    Dreamdrive: Generative 4d scene modeling from street view images

    Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 367–374. IEEE, 2025

  44. [44]

    Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, and Kaipeng Zhang. Yume1. 5: A text-controlled interactive world generation model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7752–7761, 2026

  45. [45]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

  46. [46]

    Video generation models as world simulators, 2024

    OpenAI. Video generation models as world simulators, 2024. URL https://openai.com/ index/video-generation-models-as-world-simulators/. Accessed: 2024

  47. [47]

    Movie gen: A cast of media foundation models, 2025

    Adam Polyak et al. Movie gen: A cast of media foundation models, 2025. URL https: //arxiv.org/abs/2410.13720

  48. [48]

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  49. [49]

    Generative gaussian splatting: Generating 3d scenes with video diffusion priors

    Katja Schwarz, Norman Mueller, and Peter Kontschieder. Generative gaussian splatting: Generating 3d scenes with video diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27510–27520, 2025

  50. [50]

    Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026

    Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, and Xuanchi Ren. Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026

  51. [51]

    Mvdream: Multi- view diffusion for 3d generation.arXiv preprint arXiv:2308.16512, 2023

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi- view diffusion for 3d generation.arXiv preprint arXiv:2308.16512, 2023

  52. [52]

    Splatter image: Ultra-fast single-view 3d reconstruction

    Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10208–10217, 2024

  53. [53]

    Bolt3d: Generating 3d scenes in seconds

    Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Alek- sander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24846–24857, 2025

  54. [54]

    3d gaussian flats: Hybrid 2d/3d photometric scene reconstruction.arXiv preprint arXiv:2509.16423, 2025

    Maria Taktasheva, Lily Goli, Alessandro Fiorini, Zhen Li, Daniel Rebain, and Andrea Tagliasac- chi. 3d gaussian flats: Hybrid 2d/3d photometric scene reconstruction.arXiv preprint arXiv:2509.16423, 2025

  55. [55]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

  56. [56]

    Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

    Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

  57. [57]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

  58. [58]

    Imagedream: Image-prompt multi-view diffusion for 3d generation

    Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023

  59. [59]

    Video models are zero-shot learners and reasoners,

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners,

  60. [60]

    URLhttps://arxiv.org/abs/2509.20328

  61. [61]

    Gs2mesh: Surface reconstruction from gaussian splatting via novel stereo views

    Yaniv Wolf, Amit Bracha, and Ron Kimmel. Gs2mesh: Surface reconstruction from gaussian splatting via novel stereo views. InEuropean Conference on Computer Vision, pages 207–224. Springer, 2024

  62. [62]

    Depthsplat: Connecting gaussian splatting and depth

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025

  63. [63]

    Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches.arXiv preprint arXiv:2408.04567, 2024

    Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Zhenhong Sun, Yang Li, Pan Ji, and Hongdong Li. Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches.arXiv preprint arXiv:2408.04567, 2024

  64. [64]

    Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

  65. [65]

    X-scene: Large- scale driving scene generation with high fidelity and flexible controllability.arXiv preprint arXiv:2506.13558, 2025

    Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, and Gim Hee Lee. X-scene: Large- scale driving scene generation with high fidelity and flexible controllability.arXiv preprint arXiv:2506.13558, 2025

  66. [66]

    Holodeck: Language guided generation of 3d embodied ai environments

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

  67. [67]

    Yonosplat: You only need one model for feedforward 3d gaussian splatting.arXiv preprint arXiv:2511.07321, 2025

    Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting.arXiv preprint arXiv:2511.07321, 2025

  68. [68]

    Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  69. [69]

    Immersegen: Agent-guided immersive world generation with alpha-textured proxies.IEEE Transactions on Visualization and Computer Graphics, 2026

    Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, and Yuewen Ma. Immersegen: Agent-guided immersive world generation with alpha-textured proxies.IEEE Transactions on Visualization and Computer Graphics, 2026

  70. [70]

    Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

    Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

  71. [71]

    Gs-lrm: Large reconstruction model for 3d gaussian splatting

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

  72. [72]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 14

  73. [73]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025

  74. [74]

    Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories.arXiv preprint arXiv:2603.02049, 2026

    Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, and Chunchao Guo. Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories.arXiv preprint arXiv:2603.02049, 2026

  75. [75]

    Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements.arXiv preprint arXiv:2504.08212, 2025

    Guangcong Zheng, Teng Li, Xianpan Zhou, and Xi Li. Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements.arXiv preprint arXiv:2504.08212, 2025

  76. [76]

    Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

  77. [77]

    Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

    Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025. 15 A Pipeline Flexibility A useful property of FLAT is that it generates scene pa...