pith. machine review for the scientific record. sign in

arxiv: 2412.01506 · v3 · submitted 2024-12-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Structured 3D Latents for Scalable and Versatile 3D Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D generationstructured latentsSLATrectified flow transformersmultiview featurestext-to-3Dimage-to-3Dversatile 3D assets
0
0 comments X

The pith

A structured latent that merges sparse 3D grids with dense multiview features supports high-quality generation of 3D assets in multiple output formats from text or image input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a unified Structured LATent representation that encodes both geometry and appearance by combining a sparsely populated 3D grid with dense visual features from a vision foundation model. This single latent space can be decoded into radiance fields, 3D Gaussians, or meshes without retraining the generator. Large rectified-flow transformer models trained on 500K objects reach up to two billion parameters and produce results that exceed prior methods under text or image conditioning. The design also enables local 3D editing, a capability not available in earlier single-format generators.

Core claim

Integrating a sparsely-populated 3D grid with dense multiview visual features from a vision foundation model produces a flexible latent representation that captures both structural geometry and textural appearance, allowing a single trained generator to decode into radiance fields, 3D Gaussians, or meshes while scaling effectively with models up to two billion parameters.

What carries the argument

The Structured LATent (SLAT) representation, formed by fusing a sparse 3D grid with dense multiview features extracted from a vision foundation model to jointly encode geometry and appearance while preserving decoding flexibility.

If this is right

  • A single generator can output radiance fields, 3D Gaussians, or meshes on demand after training.
  • Local editing of generated 3D assets becomes feasible without retraining the model.
  • Model scale up to two billion parameters remains stable when trained on a 500K-object dataset.
  • Performance surpasses prior methods at similar model sizes under both text and image conditioning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of sparse structural encoding from dense appearance features may simplify downstream tasks such as material editing or animation transfer.
  • Because the latent supports multiple decoders, the same generator could be paired with new output representations developed after training.
  • Scaling laws observed for the rectified-flow transformers on this latent may guide further increases in model size for even higher fidelity.

Load-bearing premise

Combining the sparse 3D grid with dense multiview features from a foundation model is sufficient to capture both geometry and appearance without restricting the range of possible output formats.

What would settle it

A controlled benchmark in which the SLAT model produces visibly lower-quality or less consistent 3D assets than recent comparable-scale baselines when evaluated on identical text-to-3D and image-to-3D prompts.

read the original abstract

We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Structured LATent (SLAT), a unified 3D representation formed by integrating a sparsely-populated 3D grid with dense multiview visual features from a vision foundation model. This latent supports decoding into multiple output formats including Radiance Fields, 3D Gaussians, and meshes. The authors train rectified flow transformers (up to 2B parameters) on a 500K-object dataset and claim that the resulting models produce high-quality 3D assets from text or image conditions, significantly outperforming prior methods at comparable scales while also enabling flexible format selection and local editing.

Significance. If the empirical claims are substantiated, the work would represent a meaningful advance in scalable 3D generation by offering a single latent that preserves both geometry and appearance while supporting multiple downstream decoders. The scale of training (2B parameters on 500K assets) and the promised public release of code, models, and data would further increase its utility for the community.

major comments (2)
  1. [Abstract] Abstract: The central claim that the model 'significantly surpassing existing methods, including recent ones at similar scales' is presented without any quantitative metrics (e.g., FID, PSNR, Chamfer distance), ablation tables, or comparative results. This absence makes the superiority assertion impossible to evaluate and is load-bearing for the paper's main contribution.
  2. [Abstract] Abstract: The assertion that the SLAT representation 'comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding' lacks any supporting analysis, reconstruction-error bounds, or ablation showing that the sparse-grid + multiview-feature combination is information-complete for all three target decoders. If the sparse grid under-samples fine geometry or the vision features misalign with 3D structure, the claimed decoding versatility cannot hold.
minor comments (1)
  1. The acronym SLAT is defined on first use, but subsequent references to 'SLAT' would benefit from a brief reminder of its components when the architecture is first detailed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where the abstract could more clearly substantiate our claims. We address each major comment below with references to the full manuscript and commit to targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the model 'significantly surpassing existing methods, including recent ones at similar scales' is presented without any quantitative metrics (e.g., FID, PSNR, Chamfer distance), ablation tables, or comparative results. This absence makes the superiority assertion impossible to evaluate and is load-bearing for the paper's main contribution.

    Authors: We agree the abstract would be stronger with explicit metrics. Section 4 of the manuscript reports quantitative comparisons on standard benchmarks, including FID for generation quality, PSNR for novel-view rendering, and Chamfer distance for geometry accuracy, showing consistent gains over prior methods at comparable model scales (e.g., 1-2B parameters). We will revise the abstract to include concise references to these key metrics and point readers to the corresponding tables and figures. revision: yes

  2. Referee: [Abstract] Abstract: The assertion that the SLAT representation 'comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding' lacks any supporting analysis, reconstruction-error bounds, or ablation showing that the sparse-grid + multiview-feature combination is information-complete for all three target decoders. If the sparse grid under-samples fine geometry or the vision features misalign with 3D structure, the claimed decoding versatility cannot hold.

    Authors: Sections 3.2 and 5.1 present reconstruction experiments and ablations that quantify geometry and appearance fidelity when decoding SLAT to Radiance Fields, 3D Gaussians, and meshes. These include per-decoder error metrics and ablation studies on grid sparsity and multiview feature alignment, demonstrating that the hybrid representation preserves the necessary information for high-quality outputs across formats. Formal information-theoretic bounds are not derived, as they are intractable for this learned hybrid latent; we will add a brief discussion of this point and of potential edge cases in sparse sampling. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical representation and training on external data

full rationale

The paper introduces SLAT as a novel latent representation defined by the explicit integration of a sparse 3D grid and dense multiview features from an external vision foundation model. This is presented as an architectural choice trained end-to-end on a 500K-object dataset, with performance claims resting on empirical results rather than any derivation that reduces to fitted parameters, self-referential definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that collapse the central claims back to the inputs by construction. The approach is self-contained against external benchmarks and does not rely on load-bearing self-citations for its core premise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only view provides no explicit free parameters or background axioms; the primary addition is the SLAT representation itself.

invented entities (1)
  • Structured LATent (SLAT) no independent evidence
    purpose: Unified representation combining sparse 3D grid and dense multiview features for decoding to multiple 3D formats
    Described as the cornerstone of the method that captures structural and textural information while allowing flexible output.

pith-pipeline@v0.9.0 · 5495 in / 1223 out tokens · 43177 ms · 2026-05-16T15:05:40.004428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats... by integrating a sparsely-populated 3D grid with dense multiview visual features... comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.

  2. Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

    cs.CV 2026-04 unverdicted novelty 7.0

    A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.

  3. SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

  4. ATATA: One Algorithm to Align Them All

    cs.CV 2026-01 unverdicted novelty 7.0

    ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.

  5. Affostruction: 3D Affordance Grounding with Generative Reconstruction

    cs.CV 2026-01 unverdicted novelty 7.0

    Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.

  6. PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.

  7. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  8. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  9. REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.

  10. Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Pair2Scene generates complex 3D scenes beyond training data by recursively applying a learned model of local support and functional object-pair relations inside hierarchies, using collision-aware rejection sampling fo...

  11. Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.

  12. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  13. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  14. UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.

  15. MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.

  16. Depth Anything 3: Recovering the Visual Space from Any Views

    cs.CV 2025-11 unverdicted novelty 6.0

    DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

  17. Syn4D: A Multiview Synthetic 4D Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

  18. Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...

  19. Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

    cs.CV 2025-06 unverdicted novelty 4.0

    Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · cited by 18 Pith papers · 6 internal anchors

  1. [1]

    Gpt-4o system card. 2024. 6, 16

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 6

  3. [3]

    Build- ing normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InICLR,

  4. [4]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 6

  5. [5]

    Demystifying mmd gans

    Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. In Interna- tional Conference on Learning Representations , 2018. 8, 18

  6. [6]

    Efficient geometry-aware 3d generative adversarial net- works

    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial net- works. In IEEE/CVF International Conference on Com- puter Vision, 2022. 3

  7. [7]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European conference on computer vision , pages 333–350. Springer,

  8. [8]

    Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction

    Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023. 3

  9. [9]

    Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis. In ICLR, 2024. 2

  10. [10]

    Meshanything: Artist-created mesh gener- ation with autoregressive transformers

    Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh gener- ation with autoregressive transformers. arXiv preprint arXiv:2406.10163, 2024. 3

  11. [11]

    3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion

    Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion. arXiv preprint arXiv:2409.12957, 2024. 3, 6, 17

  12. [12]

    Sdfusion: Multimodal 3d shape completion, reconstruction, and generation

    Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexan- der G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4456–4465, 2023. 2

  13. [13]

    Abo: Dataset and benchmarks for real-world 3d ob- ject understanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d ob- ject understanding. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 21126–21136, 2022. 6, 16

  14. [14]

    Flashattention-2: Faster attention with better par- allelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. In ICLR, 2024. 14

  15. [15]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 16

  16. [16]

    Objaverse-xl: A universe of 10m+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024. 6, 16

  17. [17]

    Gram: Generative radiance manifolds for 3d-aware im- age generation

    Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware im- age generation. In IEEE/CVF International Conference on Computer Vision, 2022. 3

  18. [18]

    Probing the 3d awareness of visual foundation models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 21795–21806, 2024. 2

  19. [19]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. 2, 3, 14

  20. [20]

    3d-future: 3d furniture shape with texture

    Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture. International Journal of Com- puter Vision, pages 1–25, 2021. 6, 16

  21. [21]

    Get3d: A generative model of high quality 3d tex- tured shapes learned from images

    Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja 9 Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images. Advances In Neural In- formation Processing Systems, 35:31841–31854, 2022. 3

  22. [22]

    Strivec: Sparse tri-vector radiance fields

    Quankai Gao, Qiangeng Xu, Hao Su, Ulrich Neumann, and Zexiang Xu. Strivec: Sparse tri-vector radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17569–17579, 2023. 2, 4

  23. [23]

    Visual fact checker: Enabling high-fidelity detailed caption generation

    Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung- Yi Lin, Ming-Yu Liu, and Yin Cui. Visual fact checker: Enabling high-fidelity detailed caption generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14033–14042, 2024. 16, 17

  24. [24]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. 2

  25. [25]

    3dgen: Triplane latent diffusion for textured mesh generation

    Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O ˘guz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023. 2, 3

  26. [26]

    Gvgen: Text-to-3d generation with volumetric representation

    Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation. In ECCV, 2024. 3

  27. [27]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 8, 18

  28. [28]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 6

  29. [29]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in Neural Informa- tion Processing Systems, 33:6840–6851, 2020. 3

  30. [30]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024. 3

  31. [31]

    Neural wavelet-domain diffusion for 3d shape generation

    Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers , pages 1–9,

  32. [32]

    Shap-E: Generating Conditional 3D Implicit Functions

    Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 3, 7

  33. [33]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

  34. [34]

    Chang, and Manolis Savva

    Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Anal- ysis of 3D Scene Scale and Realism Tradeoffs for Object- Goal Navigation. arXiv preprint, 2023. 6, 16

  35. [35]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4015–4026, 2023. 19

  36. [36]

    Modular primitives for high-performance differentiable rendering

    Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transac- tions on Graphics (ToG), 39(6):1–14, 2020. 15

  37. [37]

    Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

    Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024. 2, 3, 7

  38. [38]

    xformers: A modular and hackable trans- former modelling library

    Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable trans- former modelling library. https://github.com/ facebookresearch/xformers, 2022. 14

  39. [39]

    Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In ICLR, 2024. 3

  40. [40]

    arXiv preprint arXiv:2405.14979 , year=

    Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979, 2024. 3

  41. [41]

    Generalized deep 3d shape prior via part-discretized diffusion process

    Yuhan Li, Yishun Dou, Xuanhong Chen, Bingbing Ni, Yilin Sun, Yutian Liu, and Fuzhen Wang. Generalized deep 3d shape prior via part-discretized diffusion process. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16784–16794, 2023. 2

  42. [42]

    Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching

    Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6517–6526, 2024. 3

  43. [43]

    Magic3d: High- resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High- resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3

  44. [44]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR, 2023. 3, 5

  45. [45]

    Neural sparse voxel fields

    Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663,

  46. [46]

    One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion

    Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In Proceedings of the IEEE/CVF Conference on 10 Computer Vision and Pattern Recognition , pages 10072– 10083, 2024. 3

  47. [47]

    Meshformer: High-quality mesh generation with 3d-guided reconstruction model

    Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Ling- hao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xi- aoshuai Zhang, Isabella Liu, et al. Meshformer: High- quality mesh generation with 3d-guided reconstruction model. arXiv preprint arXiv:2408.10198, 2024

  48. [48]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 9298–9309, 2023. 2, 3

  49. [49]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023. 3

  50. [50]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 4

  51. [51]

    Mix- ture of volumetric primitives for efficient neural rendering

    Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mix- ture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (ToG) , 40(4):1–13, 2021. 15

  52. [52]

    Wonder3d: Single image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9970–9980, 2024. 3

  53. [53]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6

  54. [54]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2

  55. [55]

    Repaint: In- painting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: In- painting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11461–11471, 2022. 5

  56. [56]

    Diffusion probabilistic mod- els for 3d point cloud generation

    Shitong Luo and Wei Hu. Diffusion probabilistic mod- els for 3d point cloud generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021. 3

  57. [57]

    Scalable 3d captioning with pretrained models

    Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pretrained models. Advances in Neural Information Processing Systems , 36,

  58. [58]

    Occupancy net- works: Learning 3d reconstruction in function space

    Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy net- works: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4460–4470, 2019. 2

  59. [59]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM, 65(1):99–106, 2021. 2, 15

  60. [60]

    V-net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 15

  61. [61]

    Diffrf: Rendering-guided 3d radiance field diffusion

    Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4328–4338, 2023. 3

  62. [62]

    Polygen: An autoregressive generative model of 3d meshes

    Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learn- ing, pages 7220–7229. PMLR, 2020. 3

  63. [63]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 3, 18

  64. [64]

    Au- todecoding latent 3d diffusion models

    Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models. Advances in Neural Information Processing Systems, 36:67021–67047, 2023. 3

  65. [65]

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El- Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pa...

  66. [66]

    Deepsdf: Learning continuous signed distance functions for shape representa- tion

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representa- tion. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 165–174, 2019. 2

  67. [67]

    Scalable diffusion mod- els with transformers

    William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision , pages 4195– 4205, 2023. 5, 8

  68. [68]

    Dreamfusion: Text-to-3d using 2d diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR,

  69. [69]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 8, 18

  70. [70]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,

  71. [71]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 5, 8, 18

  72. [72]

    Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

    Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024. 2, 3

  73. [73]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3, 4, 14

  74. [74]

    Flexible isosur- face extraction for gradient-based mesh optimization

    Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosur- face extraction for gradient-based mesh optimization. ACM Trans. Graph., 42(4), 2023. 2, 5, 15

  75. [75]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 14

  76. [76]

    Mvdream: Multi-view diffusion for 3d generation

    Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. In ICLR, 2024. 3

  77. [77]

    3d neural field gen- eration using triplane diffusion

    J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field gen- eration using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023. 3

  78. [78]

    3d generation on imagenet

    Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d generation on imagenet. arXiv preprint arXiv:2303.01416,

  79. [79]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Con- ference on Machine Learning , pages 2256–2265. PMLR,

  80. [80]

    Using shape to categorize: Low-shot learning with an explicit shape bias

    Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808,

Showing first 80 references.