Recognition: no theorem link
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3
The pith
3D scene generation moves into a compact implicit 3D latent space built from frozen 2D encoders, allowing fixed 1K-token representations that support consistent output from any viewpoint.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose the first method to perform 3D scene generation directly within an implicit 3D latent space. We repurpose frozen 2D representation encoders to construct a 3D Representation Autoencoder (3DRAE) that grounds view-coupled 2D semantic features into a view-decoupled 3D latent representation. This representation encodes any scene from arbitrary views at any resolution and aspect ratio using fixed complexity and rich semantics. We then introduce a 3D Diffusion Transformer (3DDiT) that performs diffusion modeling inside this latent space, achieving efficient and spatially consistent generation that supports diverse conditioning inputs and allows decoding to images and point maps along any
What carries the argument
The 3D Representation Autoencoder (3DRAE), which converts multi-view 2D semantic features into a single view-decoupled 3D latent representation with constant token count that preserves semantics and supports consistent generation.
If this is right
- Any generated 3D scene can be decoded into consistent images or point maps along arbitrary camera trajectories without running diffusion again for each new path.
- Representation complexity stays fixed at roughly 1K tokens regardless of the number of input views, output views, image resolution, or aspect ratio.
- Spatial consistency across all viewpoints is enforced at the latent level rather than through post-hoc alignment of 2D outputs.
- Diverse conditioning signals such as text, single images, or partial 3D data can be injected directly into the 3D diffusion process.
Where Pith is reading between the lines
- The fixed token budget could allow scaling to larger or more detailed scenes without the quadratic cost growth typical of view-based methods.
- Because the latent space is built from existing 2D pre-trained encoders, the same architecture might transfer to 3D reconstruction or editing tasks with minimal additional 3D supervision.
- Real-time applications such as interactive scene synthesis in VR could become feasible if the 3D latent diffusion step runs at interactive rates.
Load-bearing premise
That semantic features from frozen 2D encoders, when aggregated across views, contain enough geometric structure to form a truly view-independent 3D latent space without dedicated 3D training data or losses.
What would settle it
Generate a scene, then render it from a continuous sequence of novel camera poses; if the outputs show geometric distortions, depth inconsistencies, or view-dependent artifacts not correctable by simple rendering, the claim that the latent space is inherently 3D-consistent would be falsified.
Figures
read the original abstract
3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes. In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views--at any resolution and aspect ratio--with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations. Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes performing 3D scene generation directly within an implicit 3D latent space. It introduces a 3D Representation Autoencoder (3DRAE) that repurposes frozen 2D representation encoders to convert view-coupled 2D semantics into a view-decoupled 3D latent representation capable of encoding scenes from arbitrary numbers of views, resolutions, and aspect ratios at fixed complexity with rich semantics. A 3D Diffusion Transformer (3DDiT) then performs diffusion modeling in this latent space to enable efficient, spatially consistent generation under diverse conditioning. The resulting 3D representation can be decoded to images and optional point maps along arbitrary camera trajectories without per-trajectory diffusion passes.
Significance. If the 3D latent and diffusion model deliver the claimed view-decoupling, semantic preservation, and spatial consistency, the work would be significant for 3D scene generation. It directly targets the redundancy and consistency limitations of 2D multi-view/video diffusion approaches by operating in a native 3D latent, potentially enabling more scalable generation with fixed token complexity and flexible decoding. The direct 3D output without repeated sampling is a practical advantage.
major comments (3)
- [Abstract] Abstract: The central claims of 'remarkably efficient and spatially consistent 3D scene generation' and effective view-decoupling via 3DRAE rest on unverified assumptions about the latent's properties; no quantitative metrics, ablations, or comparisons to 2D baselines are supplied to substantiate these, which is load-bearing for the contribution.
- [§3] §3 (3DRAE construction): The repurposing of frozen 2D encoders to produce a view-decoupled 3D latent with fixed complexity is outlined at a high level only; the absence of architectural details, any equations defining the latent mapping, or training objectives makes it impossible to evaluate whether semantics are preserved or redundancy is actually reduced.
- [§4] §4 (Experiments): No training details, loss functions, quantitative metrics (e.g., consistency scores, FID, or efficiency benchmarks), ablation studies, or error analysis are present to test whether 3DDiT sampling in the 3D latent yields the promised spatial consistency across trajectories.
minor comments (2)
- [Title/Abstract] The title references '1K Tokens' but the abstract and description do not specify how this exact token count is derived or enforced in the 3D latent representation.
- [§3] Notation for the 3D latent and conditioning mechanisms could be clarified earlier to aid readability, as the current high-level sketch leaves the mapping from 2D features to 3D tokens implicit.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas where the manuscript can be strengthened. We agree that the current version is high-level in several sections and will provide additional details, equations, and quantitative evaluations in the revised manuscript to better substantiate our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 'remarkably efficient and spatially consistent 3D scene generation' and effective view-decoupling via 3DRAE rest on unverified assumptions about the latent's properties; no quantitative metrics, ablations, or comparisons to 2D baselines are supplied to substantiate these, which is load-bearing for the contribution.
Authors: We acknowledge that the abstract's claims regarding efficiency, spatial consistency, and view-decoupling require empirical support to be fully convincing. The 3DRAE and 3DDiT are designed to achieve these properties by operating in a fixed-complexity 3D latent that decouples views, but the current manuscript does not include the requested metrics or comparisons. In the revision, we will add quantitative results such as consistency scores across trajectories, FID metrics, efficiency benchmarks (e.g., token usage and sampling time versus 2D multi-view baselines), and ablations demonstrating the latent's view-decoupling and semantic preservation. revision: yes
-
Referee: [§3] §3 (3DRAE construction): The repurposing of frozen 2D encoders to produce a view-decoupled 3D latent with fixed complexity is outlined at a high level only; the absence of architectural details, any equations defining the latent mapping, or training objectives makes it impossible to evaluate whether semantics are preserved or redundancy is actually reduced.
Authors: The current §3 focuses on the conceptual framework of repurposing frozen 2D encoders to aggregate multi-view semantics into a fixed 1K-token 3D latent. We agree that architectural specifics, equations, and objectives are needed for reproducibility and evaluation. The revised manuscript will expand this section with: (i) detailed architecture diagrams and pseudocode, (ii) equations defining the latent mapping (e.g., how view-coupled 2D features are projected and aggregated into view-decoupled 3D tokens while preserving semantics), and (iii) the training objectives, including any reconstruction losses or regularization terms used to reduce redundancy and maintain semantic richness. revision: yes
-
Referee: [§4] §4 (Experiments): No training details, loss functions, quantitative metrics (e.g., consistency scores, FID, or efficiency benchmarks), ablation studies, or error analysis are present to test whether 3DDiT sampling in the 3D latent yields the promised spatial consistency across trajectories.
Authors: We agree that §4 currently lacks the experimental rigor needed to validate the spatial consistency and efficiency claims. The revised version will include: comprehensive training details (datasets, hyperparameters, optimizer settings), explicit loss functions for both 3DRAE and 3DDiT, quantitative metrics (consistency scores, FID, LPIPS, efficiency comparisons), ablation studies (e.g., varying token count, conditioning types, and 3D vs. 2D latent baselines), and error analysis (failure cases and trajectory consistency measurements). These additions will directly test the benefits of diffusion in the 3D latent space. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core proposal introduces 3DRAE (repurposing frozen 2D encoders into a view-decoupled 3D latent) and 3DDiT (diffusion in that latent) as new architectural components. No equations, fitted parameters, or derivations are shown that reduce the claimed performance or consistency to quantities defined by the inputs themselves. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is self-contained as an empirical architectural design rather than a closed mathematical reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frozen 2D representation encoders can be repurposed to ground view-coupled 2D semantic features into a view-decoupled 3D latent representation with fixed complexity.
invented entities (2)
-
3D Representation Autoencoder (3DRAE)
no independent evidence
-
3D Diffusion Transformer (3DDiT)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
International Journal of Computer Vision120(2), 153–168 (2016) 32
Aanæs, H., Jensen, R.R., Vogiatzis, G., Tola, E., Dahl, A.B.: Large-scale data for multiple-view stereopsis. International Journal of Computer Vision120(2), 153–168 (2016) 32
2016
-
[2]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao,C.,Ge,C.,etal.:Qwen3-vltechnicalreport.arXivpreprintarXiv:2511.21631 (2025) 34
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025) 34
2025
-
[4]
CVPR (2022) 32
Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR (2022) 32
2022
-
[5]
arXiv preprint arXiv:2111.08897 (2021) 4, 8
Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021) 10, 28
-
[6]
Advances in Neural Information Processing Systems35, 25102– 25116 (2022) 2
Bautista, M.A., Guo, P., Abnar, S., Talbott, W., Toshev, A., Chen, Z., Dinh, L., Zhai, S., Goh, H., Ulbricht, D., et al.: Gaudi: A neural architect for immersive 3d scene generation. Advances in Neural Information Processing Systems35, 25102– 25116 (2022) 2
2022
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 2, 4, 5
work page internal anchor Pith review arXiv 2023
-
[8]
Cabon, Y., Murray, N., Humenberger, M.: Virtual kitti 2. arXiv preprint arXiv:2001.10773 (2020) 28
work page internal anchor Pith review arXiv 2001
-
[9]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16123–16133 (2022) 2
2022
-
[10]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Charatan, D., Li, S.L., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19457–19467 (2024) 3, 4 16 F. Author et al
2024
-
[11]
In: Proceedings of theIEEE/CVFinternationalconferenceoncomputervision.pp.2416–2425(2023) 2
Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In: Proceedings of theIEEE/CVFinternationalconferenceoncomputervision.pp.2416–2425(2023) 2
2023
-
[12]
In: Forty-second International Conference on Machine Learning (2025) 5
Chen, H., Han, Y., Chen, F., Li, X., Wang, Y., Wang, J., Wang, Z., Liu, Z., Zou, D., Raj, B.: Masked autoencoders are effective tokenizers for diffusion models. In: Forty-second International Conference on Machine Learning (2025) 5
2025
-
[13]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7310–7320 (2024) 2
2024
-
[14]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen, J., Zou, D., He, W., Chen, J., Xie, E., Han, S., Cai, H.: Dc-ae 1.5: Acceler- ating diffusion model convergence with structured latent space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19628–19637 (2025) 5
2025
-
[15]
arXiv preprint arXiv:2503.13265 (2025)
Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view syn- thesis. arXiv preprint arXiv:2503.13265 (2025) 2, 4, 34
-
[16]
In: European conference on computer vision
Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J.: Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In: European conference on computer vision. pp. 370–386. Springer (2024) 3, 4
2024
-
[17]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scan- net: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5828–5839 (2017) 10, 28, 29
2017
-
[18]
In: Forty-first international conference on machine learning (2024) 26
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz,D.,Sauer,A.,Boesel,F.,etal.:Scalingrectifiedflowtransformersforhigh- resolution image synthesis. In: Forty-first international conference on machine learning (2024) 26
2024
-
[19]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., et al.: Vlm-3r: Vision-language models augmented with instruction- aligned 3d reconstruction. arXiv preprint arXiv:2505.20279 (2025) 34
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Cat3d: Create any- thing in 3d with multi-view diffusion models,
Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314 (2024) 2, 4
-
[21]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Go, H., Park, B., Jang, J., Kim, J.Y., Kwon, S., Kim, C.: Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21524–21536 (2025) 2, 4
2025
-
[22]
Hartley,R.,Zisserman,A.:Multipleviewgeometryincomputervision.Cambridge university press (2003) 29
2003
-
[23]
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFconferenceoncomputer vision and pattern recognition. pp. 16000–16009 (2022) 5, 14
2022
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Henzler, P., Mitra, N.J., Ritschel, T.: Escaping plato’s cave: 3d shape from ad- versarial rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9984–9993 (2019) 2
2019
-
[25]
Advances in neural information processing systems35, 8633– 8646 (2022) 2 3DRAE 17
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. Advances in neural information processing systems35, 8633– 8646 (2022) 2 3DRAE 17
2022
-
[26]
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 28
Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learn- ing multi-view stereopsis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 28
2018
-
[27]
ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 2, 4
Huang, T., Zheng, W., Wang, T., Liu, Y., Wang, Z., Wu, J., Jiang, J., Li, H., Lau, R., Zuo, W., et al.: Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG)44(6), 1–15 (2025) 2, 4
2025
-
[28]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jiang, H., Tan, H., Wang, P., Jin, H., Zhao, Y., Bi, S., Zhang, K., Luan, F., Sunkavalli, K., Huang, Q., et al.: Rayzer: A self-supervised large view synthesis model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4918–4929 (2025) 3, 4
2025
-
[29]
Jin, H., Jiang, H., Tan, H., Zhang, K., Bi, S., Zhang, T., Luan, F., Snavely, N., Xu, Z.: Lvsm: A large view synthesis model with minimal 3d inductive bias. arXiv preprint arXiv:2410.17242 (2024) 3, 4, 10, 11, 12
-
[30]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Karnewar, A., Vedaldi, A., Novotny, D., Mitra, N.J.: Holodiffusion: Training a 3d diffusion model using 2d images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18423–18433 (2023) 2
2023
-
[31]
ACM Trans
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023) 3, 4
2023
-
[32]
Kim, S.W., Brown, B., Yin, K., Kreis, K., Schwarz, K., Li, D., Rombach, R., Torralba, A., Fidler, S.: Neuralfield-ldm: Scene generation with hierarchical latent diffusionmodels.In:ProceedingsoftheIEEE/CVFconferenceoncomputervision and pattern recognition. pp. 8496–8506 (2023) 2
2023
-
[33]
ACM Transactions on Graphics36(4) (2017) 32
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmark- ing large-scale scene reconstruction. ACM Transactions on Graphics36(4) (2017) 32
2017
-
[34]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024) 2, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
In: International conference on ma- chine learning
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding be- yond pixels using a learned similarity metric. In: International conference on ma- chine learning. pp. 1558–1566. PMLR (2016) 7
2016
-
[36]
In: European conference on computer vision
Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European conference on computer vision. pp. 71–91. Springer (2024) 3
2024
-
[37]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Li, R., Torr, P., Vedaldi, A., Jakab, T.: Vmem: Consistent interactive video scene generation with surfel-indexed view memory. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 25690–25699 (2025) 2, 4, 34
2025
-
[38]
Advances in neural information processing systems37, 75125–75151 (2024) 2, 4
Li, X., Lai, Z., Xu, L., Qu, Y., Cao, L., Zhang, S., Dai, B., Ji, R.: Director3d: Real-world camera trajectory and 3d scene generation from text. Advances in neural information processing systems37, 75125–75151 (2024) 2, 4
2024
-
[39]
In: Computer Vision and Pattern Recognition (CVPR) (2018) 28
Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from inter- net photos. In: Computer Vision and Pattern Recognition (CVPR) (2018) 28
2018
-
[40]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Liang, H., Cao, J., Goel, V., Qian, G., Korolev, S., Terzopoulos, D., Plataniotis, K.N., Tulyakov, S., Ren, J.: Wonderland: Navigating 3d scenes from a single im- age. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 798–810 (2025) 2, 4
2025
-
[41]
Depth Anything 3: Recovering the Visual Space from Any Views
Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025) 3, 9, 13, 23, 24 18 F. Author et al
work page internal anchor Pith review arXiv 2025
-
[42]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024) 10, 13, 14, 28, 29
2024
-
[43]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 9
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 9
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Com- munications of the ACM65(1), 99–106 (2021) 3, 4
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Com- munications of the ACM65(1), 99–106 (2021) 3, 4
2021
-
[46]
In: Proceedings of the IEEE international conference on computer vision
Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE international conference on computer vision. pp. 4990–4999 (2017) 28
2017
-
[47]
In: Proceedings of the IEEE/CVF international conference on computer vision
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: Unsuper- vised learning of 3d representations from natural images. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7588–7597 (2019) 2
2019
-
[48]
Advances in neural information processing systems33, 6767–6778 (2020) 2
Nguyen-Phuoc, T.H., Richardt, C., Mai, L., Yang, Y., Mitra, N.: Blockgan: Learn- ing 3d object-aware scene representations from unlabelled images. Advances in neural information processing systems33, 6767–6778 (2020) 2
2020
-
[49]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional genera- tive neural feature fields. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 11453–11464 (2021) 2
2021
-
[50]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,etal.:Dinov2:Learningrobust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 3, 5, 13, 23, 24, 25
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
In: Proceedings of the IEEE/CVF international conference on computer vision
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 8
2023
-
[52]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Ramakrishnan, S.K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A.X., et al.: Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for em- bodied ai. arXiv preprint arXiv:2109.08238 (2021) 28
work page internal anchor Pith review arXiv 2021
-
[53]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021) 7, 24
2021
-
[54]
In: Proceedings of the IEEE/CVF international conference on computer vision
Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d cat- egory reconstruction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10901–10911 (2021) 10, 28, 29
2021
-
[55]
In: Proceedings of the IEEE/CVF international conference on computer vision
Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holis- tic indoor scene understanding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10912–10922 (2021) 28
2021
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 5 3DRAE 19
2022
-
[57]
International journal of computer vision115(3), 211–252 (2015) 8, 14, 24, 26, 28
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015) 8, 14, 24, 26, 28
2015
-
[58]
Advances in neural information processing systems 33, 20154–20166 (2020) 2
Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.: Graf: Generative radiance fields for 3d-aware image synthesis. Advances in neural information processing systems 33, 20154–20166 (2020) 2
2020
-
[59]
In: 2012 IEEE/RSJ international conference on intelligent robots and systems
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. pp. 573–580. IEEE (2012) 29
2012
-
[60]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2446–2454 (2020) 10, 28
2020
-
[61]
In: European Conference on Computer Vision
Tang, J., Chen, Z., Chen, X., Wang, T., Zeng, G., Liu, Z.: Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In: European Conference on Computer Vision. pp. 1–18. Springer (2024) 3, 4
2024
-
[62]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 3, 5, 13, 23, 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 2, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) 3, 7, 24, 29
2025
-
[65]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Q., Wang, Z., Genova, K., Srinivasan, P.P., Zhou, H., Barron, J.T., Martin- Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4690–4699 (2021) 3, 4
2021
-
[66]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025) 3, 4, 28
2025
-
[67]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details. arXiv preprint arXiv:2507.02546 (2025) 9, 28
work page internal anchor Pith review arXiv 2025
-
[68]
DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025
Wang, S., Tian, Z., Huang, W., Wang, L.: Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741 (2025) 5, 26, 27
-
[69]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024) 3
2024
-
[70]
Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam (2020) 28
2020
-
[71]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.: pi3: Permutation-equivariant visual geometry learning. arXiv preprint arXiv:2507.13347 (2025) 9
work page internal anchor Pith review arXiv 2025
-
[72]
In: ACM SIGGRAPH 2024 Conference Papers
Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024) 2, 4, 10, 11, 12, 34 20 F. Author et al
2024
-
[73]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wei, D., Li, Z., Liu, P.: Omni-scene: Omni-gaussian representation for ego-centric sparse-view scene reconstruction. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22317–22327 (2025) 3, 4
2025
-
[74]
Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025) 34
-
[75]
Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffu- sion transformers is much easier than you think. arXiv preprint arXiv:2507.01467 (2025) 5
-
[76]
Xia, H., Fu, Y., Liu, S., Wang, X.: Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos (2024) 10, 28
2024
-
[77]
Native and Compact Structured Latents for 3D Generation.arXiv preprint arXiv:2512.14692, 2025a
Xiang, J., Chen, X., Xu, S., Wang, R., Lv, Z., Deng, Y., Zhu, H., Dong, Y., Zhao, H., Yuan, N.J., et al.: Native and compact structured latents for 3d generation. arXiv preprint arXiv:2512.14692 (2025) 4
-
[78]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Xu,H.,Peng,S.,Wang,F.,Blum,H.,Barath,D.,Geiger,A.,Pollefeys,M.:Depth- splat: Connecting gaussian splatting and depth. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16453–16463 (2025) 3, 4, 10, 11, 12
2025
-
[79]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 3, 4
Xu, Q., Wei, D., Zhao, L., Li, W., Huang, Z., Ji, S., Liu, P.: Siu3r: Simulta- neous scene understanding and 3d reconstruction beyond feature alignment. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 3, 4
2025
-
[80]
arXiv preprint arXiv:2507.15856 (2025) 5
Yang, J., Li, T., Fan, L., Tian, Y., Wang, Y.: Latent denoising makes good visual tokenizers. arXiv preprint arXiv:2507.15856 (2025) 5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.