FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Christian Rupprecht; Fabian Manhardt; Federico Tombari; Goutam Bhat; Orest Kupyn; Philipp Henzler

arxiv: 2606.24876 · v1 · pith:KVBSA65Unew · submitted 2026-06-23 · 💻 cs.CV

FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation

Orest Kupyn , Goutam Bhat , Philipp Henzler , Fabian Manhardt , Christian Rupprecht , Federico Tombari This is my paper

Pith reviewed 2026-06-26 00:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D scene generationtriangle splattingvideo diffusion modelsfeedforward decodinggeometric accuracysurface primitivesdifferentiable rendering

0 comments

The pith

Video diffusion latents can be decoded directly into oriented triangle splats for 3D scenes with improved geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion models implicitly capture multi-view geometry in compressed latents, yet most feedforward decoders still output volumetric Gaussians that lack defined surfaces. The paper demonstrates that these latents can instead be mapped in one pass to explicit triangle splats, which are surface-aligned and closer to usable 3D assets. Two components make the regression stable: a ray-centered rotation parameterization that reduces orientation sensitivity and a product window function that strengthens gradients through differentiable rendering. On benchmarks this produces markedly better geometric accuracy than Gaussian baselines while visual quality stays competitive. A short test-time step then converts the output into opaque, real-time renderable meshes for game engines.

Core claim

Triangle splats can be decoded directly from video diffusion latents for the first time. A ray-centered rotation parameterization handles the high sensitivity of flat primitive orientations, while a novel product window function improves gradient flow during differentiable triangle rendering. When trained under the same conditions as 3DGS and 2DGS variants, the resulting scenes show significantly better geometric accuracy with competitive visual quality, and a lightweight refinement converts the triangle soup into fully opaque assets.

What carries the argument

Ray-centered rotation parameterization for triangle orientation regression, combined with a product window function for stable differentiable rendering of flat primitives.

If this is right

Feedforward single-image scene generation produces outputs with measurably higher geometric accuracy than Gaussian splatting baselines.
Surface-aligned triangle primitives integrate more directly into standard graphics pipelines and simulation tools than volumetric representations.
A lightweight test-time refinement converts the predicted triangles into opaque meshes that support real-time rendering.
Systematic comparison under identical training reveals concrete tradeoffs among 3DGS, 2DGS, and triangle splatting representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other latent diffusion backbones if their latents prove similarly rich in geometric cues.
Surface primitives may simplify downstream tasks such as collision detection or texture baking that currently require extra post-processing of Gaussian outputs.
Testing whether the same parameterization works when the input is a short video clip rather than a single image would clarify how much multi-view information the latents must contain.

Load-bearing premise

The compressed latents of existing video diffusion models already encode sufficient explicit multi-view geometric structure to support direct regression of oriented triangle primitives without additional geometric supervision or iterative refinement during training.

What would settle it

An ablation that trains the identical decoder architecture on the same latents but outputs 3D Gaussians instead of triangles and measures no geometric accuracy improvement, or removal of the rotation parameterization that causes training to fail to converge on flat primitives.

Figures

Figures reproduced from arXiv: 2606.24876 by Christian Rupprecht, Fabian Manhardt, Federico Tombari, Goutam Bhat, Orest Kupyn, Philipp Henzler.

**Figure 2.** Figure 2: Pipeline: Starting from a single image, we construct a point-cloud-based control video by rendering along the target camera trajectory. The control video and camera embeddings condition a frozen video diffusion model [7]. The scene decoder then fuses denoised video latent with the camera latent and decodes triangle-splat scene representation for novel-view synthesis. optimization. However, extending feedfo… view at source ↗

**Figure 3.** Figure 3: Window Function: Comparison of sigmoid-based window function [26, 14], max edge distance is used in [25] and ours. FLAT function extends the influence outside the triangle boundary and improves gradient flow by routing to all three vertices. require additional constraints. Thus, feedforward training is particularly sensitive to the choice of parameterization. We predict each triangle relative to a ray-cent… view at source ↗

**Figure 4.** Figure 4: Geometric Quality: The latent triangle model generates finer, more accurate geometry compared to Gaussian representations that are optimized for visual quality, while still maintaining high rendering fidelity [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Pipeline Flexibility: FLAT replaces the standard RGB decoder with a latent scene decoder. Because multiple Wan variants share the same latent space, our scene decoder can be attached to any of these, including image-to-video, text-to-video, video-to-video, interactive, and world-consistent pipelines [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Text-to-3D Scene: Examples obtained by attaching FLAT to a Wan-2.1 text-to-video pipeline. For each scene, we show rendered views together with the corresponding predicted normal map. The examples demonstrate that the same latent scene decoder can convert text-to-video model latents into explicit geometry. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: , where optimization improves both rendered appearance and predicted normals [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Metric Limitations: Gaussians are optimized for PSNR directly due to their inherent smoothness. The triangle model often generates sharper details while achieving lower PSNR [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Results: More qualitative results covering indoor, outdoor, and object-centric scenes, focusing on surface and visual quality. Each sample consists of input image, novel view and novel view normal map. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Failure Cases: Thin, elongated surfaces, tiny details and reflections remain challenging to model with triangles [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Converted Mesh: Top Row: semi opaque triangles predicted by the model. Bottow Row: opaque game engine compatible mesh generated by lightweight conversion step. The scene render remains high as strong semi-opaque geometrically accurate initial predictions simplify conversion process. E Scene Decoder Architecture Scene decoder matches Wan-2.1 VAE architecture. It utilizes 3D causal convolution (CausalConv3D… view at source ↗

**Figure 12.** Figure 12: Cross-Platform Rendering: Rendering raw output without any postprocessing or mesh cleanup. The converted solid triangles can be rasterized by any rendering engine across various platforms, supporting high-resolution and high-fps efficient rendering across devices. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 11.** Figure 11: The resulting mesh can be effectively rendered on any platform with high efficiency. We [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

read the original abstract

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLAT is the first direct regression of oriented triangle splats from video diffusion latents, using ray-centered rotation and a product window to fix gradient issues, but the geometric gains rest on an untested assumption about what the latents already contain.

read the letter

The paper's core move is decoding triangle splats straight from the latents of existing video diffusion models instead of Gaussians. They introduce a ray-centered rotation parameterization and a product window function to stabilize training on flat primitives, which are sensitive to orientation. They also run the same training setup across 3DGS, 2DGS, and their triangle version for a direct comparison, and add a lightweight test-time step to turn the output into an opaque mesh for real-time engines.

This is useful because surface primitives fit better into standard graphics and simulation pipelines than volumetric Gaussians. The systematic representation comparison is the part that stands out as concrete and reproducible.

The main uncertainty is whether the compressed latents already hold enough explicit multi-view geometry for single-pass regression. The abstract notes the orientation sensitivity problem but solves it only with decoder changes; there is no probe or ablation shown that measures how much geometric signal is actually present in the latents before those changes. If that signal is weak, the claimed accuracy improvements could shrink. The abstract also states the geometric gains without visible numbers or error breakdowns, so the size of the advance is still unclear.

This is for groups already working on feedforward scene generation from single images who need surface outputs rather than point clouds or volumes. It is worth sending to peer review because the representation shift and the controlled comparison are concrete enough to check in detail, even if the latent-geometry assumption needs stronger evidence.

Referee Report

2 major / 1 minor

Summary. The paper introduces FLAT, a feedforward decoder that maps compressed latents from video diffusion models directly to oriented triangle splats for single-image 3D scene generation. It proposes a ray-centered rotation parameterization and a product window function to mitigate poor gradient flow when regressing flat primitives, claims significantly improved geometric accuracy over 3D/2D Gaussian baselines under identical training, and adds a lightweight test-time refinement step to produce opaque, real-time renderable assets. A systematic comparison of representation tradeoffs (3DGS, 2DGS, triangle splatting) is also presented.

Significance. If the geometric accuracy gains are substantiated by the experiments, the work would advance feedforward scene generation by shifting from volumetric to surface-aligned primitives that are closer to explicit assets usable in simulation and graphics pipelines. The identical-training-setup comparison of representations is a clear strength that enables fair tradeoff analysis.

major comments (2)

[Abstract] Abstract and § on method: The claim that existing video diffusion latents already embed sufficient explicit multi-view geometric structure for single-pass oriented-triangle regression is load-bearing for the 'direct decoding' premise, yet the provided text gives no direct measurement (e.g., latent-to-geometry probe accuracy or ablation that removes geometric cues from the diffusion prior). Decoder-side fixes alone cannot supply absent information.
[Abstract] Abstract: The statement that triangle splats achieve 'significantly better geometric accuracy' while remaining competitive in visual quality requires the quantitative tables and error analysis that are referenced but not visible in the supplied text; without those numbers the central empirical claim cannot be evaluated.

minor comments (1)

The project page URL is given but the abstract contains no numerical results, benchmark names, or ablation tables, making it difficult to assess the scale of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, with planned revisions where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract and § on method: The claim that existing video diffusion latents already embed sufficient explicit multi-view geometric structure for single-pass oriented-triangle regression is load-bearing for the 'direct decoding' premise, yet the provided text gives no direct measurement (e.g., latent-to-geometry probe accuracy or ablation that removes geometric cues from the diffusion prior). Decoder-side fixes alone cannot supply absent information.

Authors: We agree that a direct probe or ablation isolating the geometric content of the latents would provide stronger support for the premise. The current evidence is indirect via the decoder's superior geometric metrics relative to Gaussian baselines under identical training. We will add a linear probe experiment on the latents for depth/normal prediction and a control ablation using a non-geometric encoder in the revised Section 4 to quantify the prior's contribution. revision: yes
Referee: [Abstract] Abstract: The statement that triangle splats achieve 'significantly better geometric accuracy' while remaining competitive in visual quality requires the quantitative tables and error analysis that are referenced but not visible in the supplied text; without those numbers the central empirical claim cannot be evaluated.

Authors: The supporting numbers appear in the full manuscript's Tables 1–3 and Figures 4–6 (Section 4), which report depth error, normal consistency, PSNR/SSIM/LPIPS, and error distributions for all representations under matched training. We will revise the abstract to include one or two key quantitative deltas and ensure explicit table references appear in the method and results sections. revision: partial

Circularity Check

0 steps flagged

No circularity: novel decoder components and empirical comparisons are independent of inputs

full rationale

The paper's core contribution consists of two explicitly introduced decoder innovations (ray-centered rotation parameterization and product window function) that address gradient issues in triangle regression from external video diffusion latents. These are not derived from or equivalent to the target outputs by construction, nor do any equations reduce fitted parameters to predictions. The evaluation compares representations under identical training setups using external priors, with no load-bearing self-citations or ansatzes imported from prior author work. The assumption that latents contain usable geometry is tested empirically rather than presupposed definitionally, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that diffusion latents contain usable geometric structure.

pith-pipeline@v0.9.1-grok · 5841 in / 1089 out tokens · 27029 ms · 2026-06-26T00:09:23.132397+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 19 linked inside Pith

[1]

Onestory: Coherent multi-shot video generation with adaptive memory

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16173–16184, 2026

2026
[2]

Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

arXiv 2026
[3]

Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xu- anchi Ren

Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xu- anchi Ren. Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation. InInternational Conference on Learning Representations (ICLR), 2026

2026
[4]

Normalcrafter: Learning temporally consistent normals from video diffusion priors

Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, and Bing Wang. Normalcrafter: Learning temporally consistent normals from video diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339, 2025

2025
[5]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024
[6]

Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image-to-3d generation and reconstruction

Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, et al. Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image-to-3d generation and reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25062–25072, 2025

2025
[7]

Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025

2025
[8]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024

2024
[9]

Beyond gaussians: Fast and high-fidelity 3d splatting with linear kernels

Haodong Chen, Runnan Chen, Qiang Qu, Zhaoqing Wang, Tongliang Liu, Xiaoming Chen, and Yuk Ying Chung. Beyond gaussians: Fast and high-fidelity 3d splatting with linear kernels. arXiv preprint arXiv:2411.12440, 2024. 10

arXiv 2024
[10]

Learning to predict 3d objects with an interpolation-based differentiable renderer

Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019

2019
[11]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean conference on computer vision, pages 370–386. Springer, 2024

2024
[12]

Wan-move: Motion-controllable video generation via latent trajectory guidance.Advances in Neural Information Processing Systems, 38:404–432, 2026

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Dingdong W ANG, Hong- wei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.Advances in Neural Information Processing Systems, 38:404–432, 2026

2026
[13]

Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Pith/arXiv arXiv 2025
[14]

Cvxnet: Learnable convex decomposition

Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. Cvxnet: Learnable convex decomposition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 31–44, 2020

2020
[15]

An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Pith/arXiv arXiv 2010
[16]

Videogpa: Distilling geometry priors for 3d-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. Videogpa: Distilling geometry priors for 3d-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

Pith/arXiv arXiv 2026
[17]

Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

Pith/arXiv arXiv 2024
[18]

Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026
[19]

Radiant foam: Real-time differentiable ray tracing

Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi, and Andrea Tagliasacchi. Radiant foam: Real-time differentiable ray tracing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4135–4145, 2025

2025
[20]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023
[21]

Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724, 2026

Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, and Mike Zheng Shou. Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724, 2026

Pith/arXiv arXiv 2026
[22]

Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

Antoine Guédon, Diego Gomez, Nissim Maruani, Bingchen Gong, George Drettakis, and Maks Ovsjanikov. Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

2025
[23]

Ges: Generalized exponential splatting for efficient radiance field rendering

Abdullah Hamdi, Luke Melas-Kyriazi, Jinjie Mai, Guocheng Qian, Ruoshi Liu, Carl V ondrick, Bernard Ghanem, and Andrea Vedaldi. Ges: Generalized exponential splatting for efficient radiance field rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19812–19822, 2024

2024
[24]

Meshsplatting: Differentiable rendering with opaque meshes.arXiv preprint arXiv:2512.06818, 2025

Jan Held, Sanghyun Son, Renaud Vandeghen, Daniel Rebain, Matheus Gadelha, Yi Zhou, An- thony Cioppa, Ming C Lin, Marc Van Droogenbroeck, and Andrea Tagliasacchi. Meshsplatting: Differentiable rendering with opaque meshes.arXiv preprint arXiv:2512.06818, 2025. 11

arXiv 2025
[25]

Triangle splatting for real-time radiance field rendering

Jan Held, Renaud Vandeghen, Adrien Deliege, Abdullah Hamdi, Daniel Rebain, Silvio Giancola, Anthony Cioppa, Andrea Vedaldi, Bernard Ghanem, Andrea Tagliasacchi, et al. Triangle splatting for real-time radiance field rendering. InThirteenth International Conference on 3D Vision, 2025

2025
[26]

3d convex splatting: Radiance field rendering with 3d smooth convexes

Jan Held, Renaud Vandeghen, Abdullah Hamdi, Adrien Deliege, Anthony Cioppa, Silvio Giancola, Andrea Vedaldi, Bernard Ghanem, and Marc Van Droogenbroeck. 3d convex splatting: Radiance field rendering with 3d smooth convexes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21360–21369, 2025

2025
[27]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[28]

2d gaussian splatting for geometrically accurate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

2024
[29]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

2026
[30]

Deformable radial kernel splatting

Yi-Hua Huang, Ming-Xian Lin, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Deformable radial kernel splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21513–21523, 2025

2025
[31]

Lvsm: A large view synthesis model with minimal 3d inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. arXiv preprint arXiv:2410.17242, 2024

arXiv 2024
[32]

Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

Pith/arXiv arXiv 2025
[33]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023
[34]

Neuralfield-ldm: Scene generation with hierarchical latent diffusion models

Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, and Sanja Fidler. Neuralfield-ldm: Scene generation with hierarchical latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8496–8506, 2023

2023
[35]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[36]

Epipolar geometry improves video generation models.arXiv preprint arXiv:2510.21615, 2025

Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry improves video generation models.arXiv preprint arXiv:2510.21615, 2025

Pith/arXiv arXiv 2025
[37]

S3od: Towards generalizable salient object detection with synthetic data

Orest Kupyn, Hirokatsu Kataoka, and Christian Rupprecht. S3od: Towards generalizable salient object detection with synthetic data. InInternational Conference on Learning Representations (ICLR), 2026

2026
[38]

Wonderland: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 798–810, 2025

2025
[39]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 12

Pith/arXiv arXiv 2025
[40]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

2024
[41]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

2023
[42]

Soft rasterizer: A differentiable renderer for image-based 3d reasoning

Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 7708–7717, 2019

2019
[43]

Dreamdrive: Generative 4d scene modeling from street view images

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 367–374. IEEE, 2025

2025
[44]

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, and Kaipeng Zhang. Yume1. 5: A text-controlled interactive world generation model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7752–7761, 2026

2026
[45]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

2021
[46]

Video generation models as world simulators, 2024

OpenAI. Video generation models as world simulators, 2024. URL https://openai.com/ index/video-generation-models-as-world-simulators/. Accessed: 2024

2024
[47]

Movie gen: A cast of media foundation models, 2025

Adam Polyak et al. Movie gen: A cast of media foundation models, 2025. URL https: //arxiv.org/abs/2410.13720

Pith/arXiv arXiv 2025
[48]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

2020
[49]

Generative gaussian splatting: Generating 3d scenes with video diffusion priors

Katja Schwarz, Norman Mueller, and Peter Kontschieder. Generative gaussian splatting: Generating 3d scenes with video diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27510–27520, 2025

2025
[50]

Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, and Xuanchi Ren. Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026

Pith/arXiv arXiv 2026
[51]

Mvdream: Multi- view diffusion for 3d generation.arXiv preprint arXiv:2308.16512, 2023

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi- view diffusion for 3d generation.arXiv preprint arXiv:2308.16512, 2023

Pith/arXiv arXiv 2023
[52]

Splatter image: Ultra-fast single-view 3d reconstruction

Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10208–10217, 2024

2024
[53]

Bolt3d: Generating 3d scenes in seconds

Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Alek- sander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24846–24857, 2025

2025
[54]

3d gaussian flats: Hybrid 2d/3d photometric scene reconstruction.arXiv preprint arXiv:2509.16423, 2025

Maria Taktasheva, Lily Goli, Alessandro Fiorini, Zhen Li, Daniel Rebain, and Andrea Tagliasac- chi. 3d gaussian flats: Hybrid 2d/3d photometric scene reconstruction.arXiv preprint arXiv:2509.16423, 2025

arXiv 2025
[55]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

Pith/arXiv arXiv 2025
[56]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

arXiv 2025
[57]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[58]

Imagedream: Image-prompt multi-view diffusion for 3d generation

Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023

arXiv 2023
[59]

Video models are zero-shot learners and reasoners,

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners,
[60]

URLhttps://arxiv.org/abs/2509.20328

Pith/arXiv arXiv
[61]

Gs2mesh: Surface reconstruction from gaussian splatting via novel stereo views

Yaniv Wolf, Amit Bracha, and Ron Kimmel. Gs2mesh: Surface reconstruction from gaussian splatting via novel stereo views. InEuropean Conference on Computer Vision, pages 207–224. Springer, 2024

2024
[62]

Depthsplat: Connecting gaussian splatting and depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025

2025
[63]

Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches.arXiv preprint arXiv:2408.04567, 2024

Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Zhenhong Sun, Yang Li, Pan Ji, and Hongdong Li. Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches.arXiv preprint arXiv:2408.04567, 2024

arXiv 2024
[64]

Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Pith/arXiv arXiv 2025
[65]

X-scene: Large- scale driving scene generation with high fidelity and flexible controllability.arXiv preprint arXiv:2506.13558, 2025

Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, and Gim Hee Lee. X-scene: Large- scale driving scene generation with high fidelity and flexible controllability.arXiv preprint arXiv:2506.13558, 2025

arXiv 2025
[66]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024
[67]

Yonosplat: You only need one model for feedforward 3d gaussian splatting.arXiv preprint arXiv:2511.07321, 2025

Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting.arXiv preprint arXiv:2511.07321, 2025

arXiv 2025
[68]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024
[69]

Immersegen: Agent-guided immersive world generation with alpha-textured proxies.IEEE Transactions on Visualization and Computer Graphics, 2026

Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, and Yuewen Ma. Immersegen: Agent-guided immersive world generation with alpha-textured proxies.IEEE Transactions on Visualization and Computer Graphics, 2026

2026
[70]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

arXiv 2026
[71]

Gs-lrm: Large reconstruction model for 3d gaussian splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

2024
[72]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 14

2018
[73]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025

2025
[74]

Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories.arXiv preprint arXiv:2603.02049, 2026

Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, and Chunchao Guo. Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories.arXiv preprint arXiv:2603.02049, 2026

arXiv 2026
[75]

Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements.arXiv preprint arXiv:2504.08212, 2025

Guangcong Zheng, Teng Li, Xianpan Zhou, and Xi Li. Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements.arXiv preprint arXiv:2504.08212, 2025

arXiv 2025
[76]

Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Pith/arXiv arXiv 2018
[77]

Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025. 15 A Pipeline Flexibility A useful property of FLAT is that it generates scene pa...

2025

[1] [1]

Onestory: Coherent multi-shot video generation with adaptive memory

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, et al. Onestory: Coherent multi-shot video generation with adaptive memory. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16173–16184, 2026

2026

[2] [2]

Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599, 2026

arXiv 2026

[3] [3]

Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xu- anchi Ren

Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xu- anchi Ren. Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation. InInternational Conference on Learning Representations (ICLR), 2026

2026

[4] [4]

Normalcrafter: Learning temporally consistent normals from video diffusion priors

Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, and Bing Wang. Normalcrafter: Learning temporally consistent normals from video diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339, 2025

2025

[5] [5]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 2024

2024

[6] [6]

Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image-to-3d generation and reconstruction

Yuanhao Cai, He Zhang, Kai Zhang, Yixun Liang, Mengwei Ren, Fujun Luan, Qing Liu, Soo Ye Kim, Jianming Zhang, Zhifei Zhang, et al. Baking gaussian splatting into diffusion denoiser for fast and scalable single-stage image-to-3d generation and reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25062–25072, 2025

2025

[7] [7]

Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Uni3c: Unifying precisely 3d-enhanced camera and human motion controls for video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12, 2025

2025

[8] [8]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024

2024

[9] [9]

Beyond gaussians: Fast and high-fidelity 3d splatting with linear kernels

Haodong Chen, Runnan Chen, Qiang Qu, Zhaoqing Wang, Tongliang Liu, Xiaoming Chen, and Yuk Ying Chung. Beyond gaussians: Fast and high-fidelity 3d splatting with linear kernels. arXiv preprint arXiv:2411.12440, 2024. 10

arXiv 2024

[10] [10]

Learning to predict 3d objects with an interpolation-based differentiable renderer

Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019

2019

[11] [11]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean conference on computer vision, pages 370–386. Springer, 2024

2024

[12] [12]

Wan-move: Motion-controllable video generation via latent trajectory guidance.Advances in Neural Information Processing Systems, 38:404–432, 2026

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Dingdong W ANG, Hong- wei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.Advances in Neural Information Processing Systems, 38:404–432, 2026

2026

[13] [13]

Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Pith/arXiv arXiv 2025

[14] [14]

Cvxnet: Learnable convex decomposition

Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. Cvxnet: Learnable convex decomposition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 31–44, 2020

2020

[15] [15]

An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

Pith/arXiv arXiv 2010

[16] [16]

Videogpa: Distilling geometry priors for 3d-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, and Yue Wang. Videogpa: Distilling geometry priors for 3d-consistent video generation.arXiv preprint arXiv:2601.23286, 2026

Pith/arXiv arXiv 2026

[17] [17]

Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024

Pith/arXiv arXiv 2024

[18] [18]

Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026

[19] [19]

Radiant foam: Real-time differentiable ray tracing

Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi, and Andrea Tagliasacchi. Radiant foam: Real-time differentiable ray tracing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4135–4145, 2025

2025

[20] [20]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023

[21] [21]

Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724, 2026

Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, and Mike Zheng Shou. Anyflow: Any-step video diffusion model with on-policy flow map distillation.arXiv preprint arXiv:2605.13724, 2026

Pith/arXiv arXiv 2026

[22] [22]

Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

Antoine Guédon, Diego Gomez, Nissim Maruani, Bingchen Gong, George Drettakis, and Maks Ovsjanikov. Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

2025

[23] [23]

Ges: Generalized exponential splatting for efficient radiance field rendering

Abdullah Hamdi, Luke Melas-Kyriazi, Jinjie Mai, Guocheng Qian, Ruoshi Liu, Carl V ondrick, Bernard Ghanem, and Andrea Vedaldi. Ges: Generalized exponential splatting for efficient radiance field rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19812–19822, 2024

2024

[24] [24]

Meshsplatting: Differentiable rendering with opaque meshes.arXiv preprint arXiv:2512.06818, 2025

Jan Held, Sanghyun Son, Renaud Vandeghen, Daniel Rebain, Matheus Gadelha, Yi Zhou, An- thony Cioppa, Ming C Lin, Marc Van Droogenbroeck, and Andrea Tagliasacchi. Meshsplatting: Differentiable rendering with opaque meshes.arXiv preprint arXiv:2512.06818, 2025. 11

arXiv 2025

[25] [25]

Triangle splatting for real-time radiance field rendering

Jan Held, Renaud Vandeghen, Adrien Deliege, Abdullah Hamdi, Daniel Rebain, Silvio Giancola, Anthony Cioppa, Andrea Vedaldi, Bernard Ghanem, Andrea Tagliasacchi, et al. Triangle splatting for real-time radiance field rendering. InThirteenth International Conference on 3D Vision, 2025

2025

[26] [26]

3d convex splatting: Radiance field rendering with 3d smooth convexes

Jan Held, Renaud Vandeghen, Abdullah Hamdi, Adrien Deliege, Anthony Cioppa, Silvio Giancola, Andrea Vedaldi, Bernard Ghanem, and Marc Van Droogenbroeck. 3d convex splatting: Radiance field rendering with 3d smooth convexes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21360–21369, 2025

2025

[27] [27]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[28] [28]

2d gaussian splatting for geometrically accurate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024

2024

[29] [29]

Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.Advances in Neural Information Processing Systems, 38:167283–167308, 2026

2026

[30] [30]

Deformable radial kernel splatting

Yi-Hua Huang, Ming-Xian Lin, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Deformable radial kernel splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21513–21523, 2025

2025

[31] [31]

Lvsm: A large view synthesis model with minimal 3d inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. arXiv preprint arXiv:2410.17242, 2024

arXiv 2024

[32] [32]

Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

Pith/arXiv arXiv 2025

[33] [33]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

2023

[34] [34]

Neuralfield-ldm: Scene generation with hierarchical latent diffusion models

Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, and Sanja Fidler. Neuralfield-ldm: Scene generation with hierarchical latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8496–8506, 2023

2023

[35] [35]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[36] [36]

Epipolar geometry improves video generation models.arXiv preprint arXiv:2510.21615, 2025

Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry improves video generation models.arXiv preprint arXiv:2510.21615, 2025

Pith/arXiv arXiv 2025

[37] [37]

S3od: Towards generalizable salient object detection with synthetic data

Orest Kupyn, Hirokatsu Kataoka, and Christian Rupprecht. S3od: Towards generalizable salient object detection with synthetic data. InInternational Conference on Learning Representations (ICLR), 2026

2026

[38] [38]

Wonderland: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 798–810, 2025

2025

[39] [39]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 12

Pith/arXiv arXiv 2025

[40] [40]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

2024

[41] [41]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

2023

[42] [42]

Soft rasterizer: A differentiable renderer for image-based 3d reasoning

Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 7708–7717, 2019

2019

[43] [43]

Dreamdrive: Generative 4d scene modeling from street view images

Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 367–374. IEEE, 2025

2025

[44] [44]

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, and Kaipeng Zhang. Yume1. 5: A text-controlled interactive world generation model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7752–7761, 2026

2026

[45] [45]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

2021

[46] [46]

Video generation models as world simulators, 2024

OpenAI. Video generation models as world simulators, 2024. URL https://openai.com/ index/video-generation-models-as-world-simulators/. Accessed: 2024

2024

[47] [47]

Movie gen: A cast of media foundation models, 2025

Adam Polyak et al. Movie gen: A cast of media foundation models, 2025. URL https: //arxiv.org/abs/2410.13720

Pith/arXiv arXiv 2025

[48] [48]

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

2020

[49] [49]

Generative gaussian splatting: Generating 3d scenes with video diffusion priors

Katja Schwarz, Norman Mueller, and Peter Kontschieder. Generative gaussian splatting: Generating 3d scenes with video diffusion priors. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27510–27520, 2025

2025

[50] [50]

Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, and Xuanchi Ren. Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026

Pith/arXiv arXiv 2026

[51] [51]

Mvdream: Multi- view diffusion for 3d generation.arXiv preprint arXiv:2308.16512, 2023

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi- view diffusion for 3d generation.arXiv preprint arXiv:2308.16512, 2023

Pith/arXiv arXiv 2023

[52] [52]

Splatter image: Ultra-fast single-view 3d reconstruction

Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10208–10217, 2024

2024

[53] [53]

Bolt3d: Generating 3d scenes in seconds

Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Alek- sander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24846–24857, 2025

2025

[54] [54]

3d gaussian flats: Hybrid 2d/3d photometric scene reconstruction.arXiv preprint arXiv:2509.16423, 2025

Maria Taktasheva, Lily Goli, Alessandro Fiorini, Zhen Li, Daniel Rebain, and Andrea Tagliasac- chi. 3d gaussian flats: Hybrid 2d/3d photometric scene reconstruction.arXiv preprint arXiv:2509.16423, 2025

arXiv 2025

[55] [55]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

Pith/arXiv arXiv 2025

[56] [56]

Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

Angtian Wang, Haibin Huang, Jacob Zhiyuan Fang, Yiding Yang, and Chongyang Ma. Ati: Any trajectory instruction for controllable video generation.arXiv preprint arXiv:2505.22944, 2025

arXiv 2025

[57] [57]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[58] [58]

Imagedream: Image-prompt multi-view diffusion for 3d generation

Peng Wang and Yichun Shi. Imagedream: Image-prompt multi-view diffusion for 3d generation. arXiv preprint arXiv:2312.02201, 2023

arXiv 2023

[59] [59]

Video models are zero-shot learners and reasoners,

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners,

[60] [60]

URLhttps://arxiv.org/abs/2509.20328

Pith/arXiv arXiv

[61] [61]

Gs2mesh: Surface reconstruction from gaussian splatting via novel stereo views

Yaniv Wolf, Amit Bracha, and Ron Kimmel. Gs2mesh: Surface reconstruction from gaussian splatting via novel stereo views. InEuropean Conference on Computer Vision, pages 207–224. Springer, 2024

2024

[62] [62]

Depthsplat: Connecting gaussian splatting and depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025

2025

[63] [63]

Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches.arXiv preprint arXiv:2408.04567, 2024

Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Zhenhong Sun, Yang Li, Pan Ji, and Hongdong Li. Sketch2scene: Automatic generation of interactive 3d game scenes from user’s casual sketches.arXiv preprint arXiv:2408.04567, 2024

arXiv 2024

[64] [64]

Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Pith/arXiv arXiv 2025

[65] [65]

X-scene: Large- scale driving scene generation with high fidelity and flexible controllability.arXiv preprint arXiv:2506.13558, 2025

Yu Yang, Alan Liang, Jianbiao Mei, Yukai Ma, Yong Liu, and Gim Hee Lee. X-scene: Large- scale driving scene generation with high fidelity and flexible controllability.arXiv preprint arXiv:2506.13558, 2025

arXiv 2025

[66] [66]

Holodeck: Language guided generation of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

2024

[67] [67]

Yonosplat: You only need one model for feedforward 3d gaussian splatting.arXiv preprint arXiv:2511.07321, 2025

Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting.arXiv preprint arXiv:2511.07321, 2025

arXiv 2025

[68] [68]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024

[69] [69]

Immersegen: Agent-guided immersive world generation with alpha-textured proxies.IEEE Transactions on Visualization and Computer Graphics, 2026

Jinyan Yuan, Bangbang Yang, Keke Wang, Panwang Pan, Lin Ma, Xuehai Zhang, Xiao Liu, Zhaopeng Cui, and Yuewen Ma. Immersegen: Agent-guided immersive world generation with alpha-textured proxies.IEEE Transactions on Visualization and Computer Graphics, 2026

2026

[70] [70]

Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, and Li Yuan. Helios: Real real-time long video generation model.arXiv preprint arXiv:2603.04379, 2026

arXiv 2026

[71] [71]

Gs-lrm: Large reconstruction model for 3d gaussian splatting

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2024

2024

[72] [72]

The unrea- sonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 14

2018

[73] [73]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025

2025

[74] [74]

Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories.arXiv preprint arXiv:2603.02049, 2026

Yisu Zhang, Chenjie Cao, Tengfei Wang, Xuhui Zuo, Junta Wu, Jianke Zhu, and Chunchao Guo. Worldstereo: Bridging camera-guided video generation and scene reconstruction via 3d geometric memories.arXiv preprint arXiv:2603.02049, 2026

arXiv 2026

[75] [75]

Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements.arXiv preprint arXiv:2504.08212, 2025

Guangcong Zheng, Teng Li, Xianpan Zhou, and Xi Li. Realcam-vid: High-resolution video dataset with dynamic scenes and metric-scale camera movements.arXiv preprint arXiv:2504.08212, 2025

arXiv 2025

[76] [76]

Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

Pith/arXiv arXiv 2018

[77] [77]

Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025. 15 A Pipeline Flexibility A useful property of FLAT is that it generates scene pa...

2025